origin of phonetic regularity data?

nieldlr / hanzi

HanziJS is a Chinese character and NLP module for Chinese language processing for Node.js

http://hanzijs.com

MIT License

376 stars 56 forks source link

origin of phonetic regularity data? #62

Closed garfieldnate closed 3 years ago

garfieldnate commented 3 years ago

Hi,

Thanks for the wonderful project! I'm looking to replicate the phonetic regularity calculations for sinoxenic languages, but there isn't any documentation on how the data was created for this project.

What is the difference between phonetic_regularity_one and phonetic_regularity_two, and how were these datasets created? getPhoneticSet in dictionary.js differentiates between them, but I don't see this method being used anywhere.
How was the data in irregularphonetics.txt.js created?
What data is used in determinePhoneticRegularity()? I don't see references to the phonetic regularity datasets, and getPinyin() uses dictionarytraditional and dictionarySimplified, neither of which I can find the loading code for. I'm guessing that a general dictionary is used, and the returned data indicates how well the character's pronunciations match with those of each of its component characters?

nieldlr commented 3 years ago

Hi @garfieldnate,

sorry for the late reply. Totally missed this. The phonetic regularity data was all produced by me using Hanzi itself. To answer more specifically:

The difference between phonetic regularity one & two are the degree of similarity. Check this page for more detailed explanation. But basically one is exactly the same pronunication including tone, and the two is same syllable but differs in tone.
This one is interesting. Over time I noticed that some components weren't found in the cedict dictionary (because they're not words on their own) where I could gather their pinyin/pronunciation. So I slowly started filling in this data as I ran across missing data where I needed. You'll see this file is still incomplete. I merely looked these up in more complete dictionaries online or looked at their unbound component forms. Let me know if you have more questions. It's been a while since I touched that file.
You hunch is correct. It does not load the phonetic regularity datasets. Those were merely generated data anyone can use/apply where they need to. The determinePhoneticRegularity function computes the regularity on the fly using the dictionary files & the character decomposition. Here is where the files get loaded: https://github.com/nieldlr/hanzi/blob/8b80e6d85130c9412117dbc41802cf6add3a74a3/lib/data/index.js

I haven't seen anyone use the phonetic regularity functions in the wild, as its a very specific use case. Let me know if have any more questions!

garfieldnate commented 3 years ago

Thanks! This clears it all up pretty well! I wasn't expecting the data to be manually curated.

I don't want to disappoint you, but I ended up not using hanzi for my application 😅 . I instead group characters by their original phonetic component (using data from ytenx) and then classify the groups by regularity, following Heisig's Remembering the Kanji, Volume II (the classification code is over here but will probably move at some point). In the end this seemed the most consistent (I tried some very messy things before settling on this!).

nieldlr commented 3 years ago

@garfieldnate heh, no worries. All good. Glad you ended up finding a solution!