thisisparker / public_domains

Find possible host names in a source text
Creative Commons Zero v1.0 Universal
53 stars 6 forks source link

Consider improving performance #2

Open nemobis opened 2 years ago

nemobis commented 2 years ago

Perhaps it's overkill to change the ranking method, but I tested this on a 180 MB text file (probably not a good idea anyway) and it was still not done after some 20 hours of CPU time.

For comparison, an off-the-shelf BigramCollocationFinder.from_words(tokens).nbest(BigramAssocMeasures().pmi, 10000) takes about 5 minutes on the same machine and corpus, and it's probably easy to do better. (I cobbled together an example at https://framagit.org/nemobis/bots/-/blob/master/ngram_tlds.py .)

edsu commented 1 year ago

Interesting @nemobis! Have you compared the results of each approach at all?

nemobis commented 1 year ago

I tried, but I was doing Italian (more difficult) and the input I used was too dirty for the simplistic BigramCollocationFinder above, so I cut my losses and threw away everything. :) The suggestions weren't outrageously bad but it may need some tweaking for the tokenization.