Closed garfieldnate closed 3 years ago
Hi @garfieldnate,
sorry for the late reply. Totally missed this. The phonetic regularity data was all produced by me using Hanzi itself. To answer more specifically:
The difference between phonetic regularity one & two are the degree of similarity. Check this page for more detailed explanation. But basically one is exactly the same pronunication including tone, and the two is same syllable but differs in tone.
This one is interesting. Over time I noticed that some components weren't found in the cedict dictionary (because they're not words on their own) where I could gather their pinyin/pronunciation. So I slowly started filling in this data as I ran across missing data where I needed. You'll see this file is still incomplete. I merely looked these up in more complete dictionaries online or looked at their unbound component forms. Let me know if you have more questions. It's been a while since I touched that file.
You hunch is correct. It does not load the phonetic regularity datasets. Those were merely generated data anyone can use/apply where they need to. The determinePhoneticRegularity
function computes the regularity on the fly using the dictionary files & the character decomposition. Here is where the files get loaded: https://github.com/nieldlr/hanzi/blob/8b80e6d85130c9412117dbc41802cf6add3a74a3/lib/data/index.js
I haven't seen anyone use the phonetic regularity functions in the wild, as its a very specific use case. Let me know if have any more questions!
Thanks! This clears it all up pretty well! I wasn't expecting the data to be manually curated.
I don't want to disappoint you, but I ended up not using hanzi
for my application 😅 . I instead group characters by their original phonetic component (using data from ytenx) and then classify the groups by regularity, following Heisig's Remembering the Kanji, Volume II (the classification code is over here but will probably move at some point). In the end this seemed the most consistent (I tried some very messy things before settling on this!).
@garfieldnate heh, no worries. All good. Glad you ended up finding a solution!
Hi,
Thanks for the wonderful project! I'm looking to replicate the phonetic regularity calculations for sinoxenic languages, but there isn't any documentation on how the data was created for this project.
phonetic_regularity_one
andphonetic_regularity_two
, and how were these datasets created?getPhoneticSet
indictionary.js
differentiates between them, but I don't see this method being used anywhere.irregularphonetics.txt.js
created?determinePhoneticRegularity()
? I don't see references to the phonetic regularity datasets, andgetPinyin()
usesdictionarytraditional
anddictionarySimplified
, neither of which I can find the loading code for. I'm guessing that a general dictionary is used, and the returned data indicates how well the character's pronunciations match with those of each of its component characters?