Open nemobis opened 2 years ago
Interesting @nemobis! Have you compared the results of each approach at all?
I tried, but I was doing Italian (more difficult) and the input I used was too dirty for the simplistic BigramCollocationFinder above, so I cut my losses and threw away everything. :) The suggestions weren't outrageously bad but it may need some tweaking for the tokenization.
Perhaps it's overkill to change the ranking method, but I tested this on a 180 MB text file (probably not a good idea anyway) and it was still not done after some 20 hours of CPU time.
For comparison, an off-the-shelf BigramCollocationFinder.from_words(tokens).nbest(BigramAssocMeasures().pmi, 10000) takes about 5 minutes on the same machine and corpus, and it's probably easy to do better. (I cobbled together an example at https://framagit.org/nemobis/bots/-/blob/master/ngram_tlds.py .)