sailfish-keyboard / presage

Fork of Presage (http://presage.sourceforge.net/)
GNU General Public License v2.0
6 stars 10 forks source link

german presage #26

Closed matzgewinn closed 4 years ago

matzgewinn commented 4 years ago

I like to have a german presage. In openrepos I've read, tit needs a text corpus for that. I will try to get one. Do I have to take care for something special? best regards

rinigus commented 4 years ago

very good, thanks for looking into it! Corpus has to be a plain text file that will be processed by the software. For reference, used text files were from 200MB - 2GB so far.

matzgewinn commented 4 years ago

I'm a bit confused of all the xml and so on formats of the german corpora. So far I have two wordlists, they have a much smaller size than you wrote and are from here: https://github.com/solariz/german_stopwords and here: https://sourceforge.net/projects/germandict/ They are plain text files - is this enough, and if yes how to proceed?

rinigus commented 4 years ago

No, we are looking for a text with words in sentences. Such as a collection of articles, texts of speeches, logs of chatrooms, movie subtitles. Something where you could calculate what's a chance to see some word after combination of other words. Just dictionary will not help in this case

matzgewinn commented 4 years ago

ok, understand. Now I have found four files here: https://wortschatz.uni-leipzig.de/de/download They are from articles, the web, wikipedia. Fils are 55mb, 118mb, 114mb and 131mb. Unfortunately the lines isn the files are counted, so there is a number before each sentence.

rinigus commented 4 years ago

should be simple to write a script removing the count lines and then process it by presage. So, that's already good

matzgewinn commented 4 years ago

Maybe I could simply delete all number-characters in the files - but that would probably corrupt some of the sentences... -? I don't think I have the skills to write a script :-/ Anyway - should I upload the files here or what is the further procedure? Best regards!

rinigus commented 4 years ago

Thanks for finding the corpus files!

No, you cannot upload files over here. The best you could do is to make a list with specific links for the files we are expected to import.

As you don't know how to write a script, we'll have to wait till either me or someone else has time to look into it. Can't promise any specific time on my side, but if nobody will volunteer, I'll look into it.

Let's just use this issue to coordinate the effort and warn others if someone starts working on it.

So, please make a list of direct links to the files we need to download to get texts. If its using that format with one number in front of the sentence, so be it. That we will take into account.

matzgewinn commented 4 years ago

Thanks for your patience! I at least could delete the numbers (by using the "sed" command), I hope the files are usable for you. Here is the link, it is valid for seven days from now: https://send.firefox.com/download/f9cfb13f4adb6d6a/#X7qFjSYId7PpfVkRcnmWGw

Let me know if there is something else I can do. I know its a little hard for someone experienced like You to handle "unprofessionals" like me, but I really like to learn and contribute - so sorry for any inconvenience.

rinigus commented 4 years ago

I am sorry, I completely forgot about it. Would you mind to resend it?

matzgewinn commented 4 years ago

Made one file out if it and deleted all numbers. hope you can use it. best regards! https://send.firefox.com/download/1ef75eda68a7e53d/#oZpe7VFklKPXPdtJs52y1w

rinigus commented 4 years ago

Hmm, strange - link has expired again.

matzgewinn commented 4 years ago

ok, next try, if this fails too, I will use a differnet service :-) https://send.firefox.com/download/4bf05f4b2c3e44e8/#NKsjqPx2kLK0R51xFo-S5w

rinigus commented 4 years ago

I have just pushed German packages to OpenRepos. Please test and report back.

matzgewinn commented 4 years ago

Thank You so much. Tested it for "normal" purpose like mails and sms, notes an so on, and its just working fine I think. A friend of mine who is used to androids text prediction was totally ok with it, too! so lets hear what other people say...

rinigus commented 4 years ago

Excellent! I am closing it here. Please feel free to comment later in closed issue or open a new one if needed.

matzgewinn commented 4 years ago

Thank you again for your excellent work - if I can pay you a cup of coffee, let me know.

rinigus commented 4 years ago

No worries, main thing was to refresh the memory on how it was done. If you wish to donate, sure there is a link out there at github and easy to find by "donate rinigus", as far as I can see. But please feel free to skip it ...

Glad it worked out rather easily.