nieldlr / hanzi

HanziJS is a Chinese character and NLP module for Chinese language processing for Node.js
http://hanzijs.com
MIT License
375 stars 56 forks source link

Check for simplified version first while calculating frequency #26

Closed raylillywhite closed 7 years ago

raylillywhite commented 7 years ago

I don't know if this will be desired behavior for everyone since I've only recently started learning Chinese, but I noticed that some of the characters I've been learning had suspiciously low rated frequencies for being one of the first ~600 characters I've encountered in learning materials. And I realized that this might be because I'm learning traditional characters. For example 認識 (rèn shi) shows frequencies of 5792 and 6345 out of 9933. But if I convert them to the simplified characters 认识, it's 213 and 340 respectively.

It seems like there should ultimately be a separate list of frequencies for traditional and simplified characters, but I assumed that would be a much more difficult task, and I thought simply changing the priority of how traditional characters are looked up might be an improvement. This PR first checks if there's a simplified version of the character and looks up the frequency of that first, instead of simply falling back to looking for the simplified version. This assumes that traditional characters with simplified versions are not regularly used in simplified texts. If that assumption is wrong, perhaps another simple solution would be to allow handicraft.com users to choose a character set preference, and change the priority here based on that — or to show both frequencies on handicraft.com if a simplified version of a character exists?

nieldlr commented 7 years ago

Heya @raylillywhite,

so sorry for missing this PR. This is a great addition! You're are completely right in your thinking here. The frequency list data I have is based on simplified characters. So it should definitely be looking at the simplified list instead for determining it.

I'm going to pull this down and add an additional test case for catching this and then merge + publish!

Thanks again for this. This is great!

nieldlr commented 7 years ago

Continued here: https://github.com/nieldlr/Hanzi/pull/28