English words containing this character "’" are not in the data base

rspeer / wordfreq

Access a database of word frequencies, in various natural languages.

Other

1.4k stars 101 forks source link

English words containing this character "’" are not in the data base #94

Closed theRealProHacker closed 3 years ago

theRealProHacker commented 3 years ago

I tokenized an English text that contained short forms like it’ll or you’ve and then got the word frequency for each token. However, for these short forms, the zipf_freq() function gave me a frequency of zero.

Is this a problem with the character, the tokenizer or the data?

Windows 10, Python 3.9, wordfreq 2.5.0

rspeer commented 3 years ago

You're right. The word frequency database only uses straight quotes ('), so curly quotes (’) appear with zero frequency. I hadn't noticed this because I was running wordfreq on text that had passed through ftfy.

It seems reasonable that one of the lossy things that the lossy_tokenize function should do is to uncurl quotes.

rspeer commented 3 years ago

Fixed in v2.5.1.