open-dict-data / ipa-dict

Monolingual wordlists with pronunciation information in IPA
https://open-dict-data.github.io/ipa-lookup/
MIT License
555 stars 86 forks source link

IPA for Vietnamese #11

Open TasseDeCafe opened 6 years ago

TasseDeCafe commented 6 years ago

Hi!

There is an excellent IPA converter that can convert a text using the Vietnamese script into IPA (for 3 different accents): https://github.com/kirbyj/vPhon

I tested it myself when I was learning Northern Vietnamese. I haven't noticed any inaccuracies when I compared it to audio recordings from a native speaker. The only thing missing is a good Vietnamese dictionary. I'm going to try to find one, but you might already have one.

Edit: Okay, this should do the trick: https://www.informatik.uni-leipzig.de/~duc/software/misc/wordlist.html

dohliam commented 6 years ago

@TasseDeCafe Thanks very much for sharing this! :+1: It would be great to add data for Vietnamese, and it looks like the links you found have everything we would need to get started.

I tested out vPhon and it seems to produce excellent results. We should probably convert the tone numbers to IPA tone letters to be consistent though. This seems like it could be pretty straightforward using for example the chart here.

Would you be interested in generating the pronunciations using vPhon and submitting a PR? If so I would be happy to merge it. Otherwise I can do this myself using the links you provided above.

TasseDeCafe commented 6 years ago

Okay, great! I will try to generate the dictionary myself, it should be fun. In which format do you want it?

I might be able to generate dictionaries for other languages as well, but let's see how it goes with this one first.

dohliam commented 6 years ago

@TasseDeCafe Awesome!

The raw data format is pretty simple -- you can find a description here. Basically it's just a plain text file with the word and corresponding IPA separated by a tab.

The other formats (JSON, XML, etc) are automatically generated from the raw data when I update the releases.

Maybe we could generate three different dictionaries -- one each for North, Central, and South. Would that make sense?

dohliam commented 5 years ago

@TasseDeCafe By the way, just wanted to let you know about an early application of this Vietnamese IPA data. It's still in beta and very experimental, but if you go into any of the stories in the link and click on the "ipa" button on the right hand side of the page you should see the corresponding IPA! (Words that didn't match anything in the dictionary -- mostly proper names -- are marked with @.) No audio yet, but we're working on it... :smile:

Anyway, thanks again for finding these sources and generating all the data!