sailfish-keyboard / presage

Fork of Presage (http://presage.sourceforge.net/)
GNU General Public License v2.0
6 stars 10 forks source link

Using text2ngram with huge corpus files #24

Open maidis opened 6 years ago

maidis commented 6 years ago

I created a 2.7 GB corpus file for Turkish. But it seems text2ngram can't handle such a big file. Can some optimizations be made in the program to work in large files?

On my system [1] second iteration can't finish:

for i in 1 2 3; do text2ngram -n $i -l -f sqlite -o database_aa.db mytext.filtered; done

By the way, thanks for the open source alternative to XT9 and good documentation on how to use it :) I already start test it with a small corpus [2].

[1] 5950HQ + 16 GB RAM [2] https://pbs.twimg.com/media/DY_ftChXUAAQP3t.jpg:large

rinigus commented 6 years ago

We can take a look into it, but, for now, I would suggest to use as large corpus as you can on your system.

In general, NLP does require lots of RAM and we may end up not being able to reduce memory requirements.