kakaobrain / g2pm

A Neural Grapheme-to-Phoneme Conversion Package for Mandarin Chinese Based on a New Open Benchmark Dataset
Apache License 2.0
336 stars 73 forks source link

what does the special PinYin "xx5" used for #4

Open JohnHerry opened 4 years ago

JohnHerry commented 4 years ago

Hi, all, Thanks for the good job. I found there is a special PinYin "xx5" in class2idx; But there is no corpus labled with this pinyin, Then what does this Pinyin Class used for? Is there anything special?

seanie12 commented 4 years ago

Hi, class2idx is a dictionary which maps each pinyin to its own id. So the id corresponds to the index of softmax layer. The are two reasons why there is no label of "xx5". 1) There is no polyphonic character of which pinyin is xx5. 2) Our dataset does not cover all possible Chinese polyphonic characters. We collect Chinese sentences from wikipedia and label it, so some of polyphonic characters are missing in our data.

JohnHerry commented 4 years ago

Then that maybe an error from human labelling. There is no Chinese character mapped to this PinYin xx5.

Another question,why the paper balance training data with polyphone instead of polychar? I think the latter is also important, There has been many bad cases of mis-predicted pinyins, We found that polychar samples in CPP is not balanced with its pinyins. It maybe natual from wikipedia if the paper did not wash out some samples: That longer then 50, That shorter then 5, That labelled different from the two people, etc.