Closed theRealProHacker closed 3 years ago
You're right. The word frequency database only uses straight quotes ('
), so curly quotes (’
) appear with zero frequency. I hadn't noticed this because I was running wordfreq on text that had passed through ftfy.
It seems reasonable that one of the lossy things that the lossy_tokenize
function should do is to uncurl quotes.
Fixed in v2.5.1.
I tokenized an English text that contained short forms like it’ll or you’ve and then got the word frequency for each token. However, for these short forms, the
zipf_freq()
function gave me a frequency of zero.Is this a problem with the character, the tokenizer or the data?
Windows 10, Python 3.9, wordfreq 2.5.0