rspeer / wordfreq

Access a database of word frequencies, in various natural languages.
Other
1.4k stars 101 forks source link

Is there a way to use custom word lists? #52

Closed HatScripts closed 6 years ago

HatScripts commented 6 years ago

Is there a way to use custom word lists? Say if I wanted to know the frequency of the word "whale" in the text of Moby Dick. I would have thought such a task would be within the scope of this library, yet I can't find anything in the documentation about it.

I realise that I could use the tokenize method combined with something like collections.Counter, but that would seem to somewhat defeat the purpose.

I've tried the following but to no avail:

with Path("moby_dick.txt").open() as f:
    moby_dick = f.read()
    tokenized = tokenize(moby_dick, "en")
    whale_freq = word_frequency("whale", "en", wordlist=tokenized)
    print("whale_freq:", whale_freq)
with Path("moby_dick.txt").open() as f:
    moby_dick = f.read()
    whale_freq = word_frequency("whale", "en", wordlist=moby_dick)
    print("whale_freq:", whale_freq)
rspeer commented 6 years ago

I think you should use tokenize combined with collections.Counter. That's essentially what we do in the first place in the package that generates these word lists, https://github.com/LuminosoInsight/exquisite-corpus. wordfreq is just for making the results easily available.