Feature Request : Support pinyin->hanzi

abhi18av commented 7 years ago

Hi @pepebecker , I came across this node module after a lot of searching. Could you please tell me if you know of a tool that converts pinyin to hanzi?

I have a ton of text that requires, a bit of processing and then conversion to hanzi


Ng, wˇo yˇe sh`ı!  Wˇo j ̄ınni ́an zu`ı zh`ongy`ao de j`ıhu`a ji`ush`ı
y ̄ıd`ıng y`ao bˇa y ̄ıngyˇu xu ́e hˇao. Wˇo ju ́ed`ıng c ̄anji ̄a Y ̄ıngyˇu
fˇudˇaob ̄an, mˇeiti ̄an b`ei z`ıdiˇan, d ́u y ̄ıngyˇu b`aozhˇı. Wˇo ji`u b`u
x`ın xu ́e b`u hˇao! Y ̄ınw`ei yˇuy ́an de w`ent ́ı, wˇo yˇıj ̄ıng cu`ogu`o
le hˇao jˇıge sh ̄engzh ́ı j ̄ıhu`ı.

The current state of the text.

If there's isn't such a tool, could you add this functionality to this module itself?

Great job with this one though 👍

ZelphirKaltstahl commented 7 years ago

@abhi18av You could use a pinyin input method to input the text if it is not too much for that.

The problem here is, that the snippet of text you showed is not in any standard format and even irregular at times. For example you usually have the diacritic marks in front of the vowels as follows:

wˇo yˇe

... but in other places you have them on top of the vowel, as they should be, at least visually:

j ̄ınni ́an

My guess is, that this is some text form OCR (optical character recognition) maybe. Is this correct? I also guess that It is not likely that such format will be supported, as it is so irregular and uncommon.

derhuerst commented 7 years ago

... but in other places you have them on top of the vowel, as they should be, at least visually:

j ̄ınni ́an

note that i don't see them on top, so it might be related to fonts.

abhi18av commented 7 years ago

@ZelphirKaltstahl and @derhuerst , thanks for the suggestions and pointing out the font issue.

The pinyin input won't be a viable alternative, since I don't really wish to type in everything again. It seems some text processing is needed here.

But my question whether it's possible to convert well-formed pinyin to hanzi at all?

It seems possible, but I've skimmed Github and the net couldn't really find anything useful.

ZelphirKaltstahl commented 7 years ago

@derhuerst Ha, good catch! Interesting that a font would do that. I guess it is a similar / same effect as taking t and h and combining that into one special character.

@abhi18av There is yet another problem with such text unfortunately: The example is j ̄ınni ́an. It is ambiguous. It could be jīn and nián or it could be jīn, ni and ́an Pinyin to hanzi is difficult to do all automatically, because one Pinyin syllable may have many corresponding hanzi characters. That is why usually when using some Pinyin input method, you have to select the hanzi you want by pressing some number. Good input methods are clever about this and have some probability estimation for character combinations, so that they put the most probably hanzi first, but even that is sometimes wrong for whatever you want to write. In the example of 今年 (jīn nián) it would probably work well, since it is a very common word. Maybe some combined algorithm, that uses some kind of learning from real texts to cluster them into topics and then topic specific prediction of what meaning you want to express with some pinyin would do a good job. For the specific case of parsing your text, you could try to find all rules of rewriting it into proper pīnyīn, then write a specialized program and then check if the produced pinyin makes sense. Then you could put it into some input method and let it guess what the characters are from the pīnyīn. Then you would check again if the 汉字 make sense and choose alternative 汉字 if they do not make sense.

pepebecker / pinyin-convert

Feature Request : Support pinyin->hanzi #1