Open mkmoisen opened 6 years ago
Really interesting @mkmoisen. Is this related to your PR #51 ?
I'll try and dig into this to confirm your calculations. It does sound right. I'll get back soon here.
Hi @nieldlr ,
One thing worth mentioning is that my calculation here only takes simplified into account. I noticed from the source code, that you are combining traditional and simplified, so you may have some different results.
I do not believe the PR #51 is related, since I could still replicate the above error after attempting to stimultae the "boundary issue" in the above code.
It seems that there are several characters over at HanziCraft that do not have any High Frequency words, but only Medium Frequency words.
For example, the page for 幌 states that 幌子 and 札幌 are both Medium Frequency words, not High Frequency words.
The source code to determine frequencies defines a High Frequency word for a character as a word whose frequency is greater than one standard deviation from the mean for all other words that share this character.
I've calculated the mean and standard deviation of words in Weibo containing 幌 to be 7.8 and 24.8, respectively. 幌子 and 札幌 however have a frequency of 114 and 44, respectively, and should thus be considered a High Frequency word for the 幌 character.
Would you please take a look into my calculations below to see if this makes sense or if it is an incorrect conclusion?
I performed the following in Python3.6, using the
LWC-words/words_types.txt
file downloaded from the Weibo corups open access page here:Which outputs the following:
Thanks and best regards,
Matthew Moisen
PS HanziCraft is awesome!