open-dict-data / ipa-dict

Monolingual wordlists with pronunciation information in IPA
https://open-dict-data.github.io/ipa-lookup/
MIT License
555 stars 86 forks source link

Separation of phonemes #33

Open nam-ak opened 2 years ago

nam-ak commented 2 years ago

The phonemes are separated by dots on the pull request "Create pt-BR.txt" by @carmo-evan but not in any other language in this project. I have no idea how he did it and if it's possible to do it automatically in the other languages.

dohliam commented 2 years ago

@nam-ak That's a good question, and I also don't know how @carmo-evan achieved this -- perhaps with a script? I know that there are existing syllabification parsers for various languages, but it is not a simple or error-free process, and the algorithm for each language would need to be quite different.

Overall, I think it would be good to have eventual syllabification added for all languages as an eventual goal, so if you have any ideas of how to automate this in a reasonably accurate way or if you want to submit a pull request to add syllable parsing for a particular language that would be very welcome. :smile:

nam-ak commented 1 year ago

Are those syllabification parses just for words and not for theirs phonetic transcription of IPA? Because I don't think it's possible to automate in languages several languages, even in English, because unfortunately, there is no straightforward syllabification method that is accepted by a majority of linguists. I think what @carmo-evan did on the pt-BR.txt was already on the database of whatever dictionary he used to create the data for this open source project.

dohliam commented 1 year ago

Yes, I fully agree. Which of course brings us back to the reason that most of the languages in the project don't currently include syllabification...

Interestingly, the database for Dutch (which has not been merged yet) seems to include dot-separated phonemes as well. I assume that this is also a case where the source dictionary already incorporates these.

dohliam commented 1 year ago

Perhaps @VincentCCL might be able to shed some light on how syllabification was carried out for the Dutch data. For example, was it added manually, or through an automated process?

VincentCCL commented 1 year ago

For the orthographic transcriptions, we have manually defined syllable borders -- for the phonetic transcription this is not 100% clear, but our documentation does not say that they are not manual -- so we assume they have been made manually, or transferred from the orthographic transcript. The source is part of the CELEX data, cf https://kdutch.ivdnt.org/wiki/Lexica#CELEX_and_WebCelex

nam-ak commented 1 year ago

For the orthographic transcriptions, we have manually defined syllable borders -- for the phonetic transcription this is not 100% clear, but our documentation does not say that they are not manual -- so we assume they have been made manually, or transferred from the orthographic transcript. The source is part of the CELEX data, cf https://kdutch.ivdnt.org/wiki/Lexica#CELEX_and_WebCelex

"WebCelex is a webbased interface to the CELEX lexical databases of English, Dutch and German."

https://catalog.ldc.upenn.edu/LDC96L14 In this website of CELEX2 I found the following information:

"For each language, this data set contains detailed information on:

・orthography (variations in spelling, hyphenation) ・ phonology (phonetic transcriptions, variations in ・pronunciation, syllable structure, primary stress) ・morphology (derivational and compositional structure, inflectional paradigms) ・syntax (word class, word class-specific subcategorizations, argument structures) word frequency (summed word and lemma counts, based on recent and representative text corpora)"

So I assume in this lexical database the languages English and German the phonetics transcriptions are also separated by dots, it has syllabification of the IPA transcription. Can you confirm? I don't know how to access this database. Perhaps, if the license allows it, we can also substitute the currently existing data for English and German, and merge it?

VincentCCL commented 1 year ago

I'll check whether I have access to the non-Dutch data -- I am only familiar with the Dutch CELEX.

oamamao commented 1 year ago

Any updates on this? I'm interested on it