lingpy / calc-workflow

Workflows for computer-assisted language comparison: State of the Art
GNU General Public License v3.0
1 stars 1 forks source link

Chen orthography #5

Open Wu-Urbanek opened 5 years ago

Wu-Urbanek commented 5 years ago

A summary of Hmong-Mien languages phoneme inventories is provided from page 50 to page 68. I have typed page 50 to page 54 so far. See the progress in code/Chen_book_orthography.csv

Wu-Urbanek commented 5 years ago

updated an orthography profile which summarizes phonemes from Chen's book (data provided by Mattis)

LinguList commented 5 years ago

can you point me to the file please?

Wu-Urbanek commented 5 years ago

The code is https://github.com/lingpy/calc-workflow/blob/master/code/P_summarise_orthography.py The output is https://github.com/lingpy/calc-workflow/blob/master/code/summarised_orthography.tsv

LinguList commented 5 years ago

Nice. What we need to do now is: group the identical cells on the left, and separate languages where we find them by a comma. So if you have a "k" in two languages, put the "k" first, and then the two languages in the next cell separated by a comman. In this way, we can reduce the items drastically.

BTW: we will also need to add these new language names from Chinese to the languages.tsv, which has our English names, so we can identify them. Making the IPA will be simple later on. Both Nathanael and I can help, and @Schweikhard can also learn typing IPA (he probably knows already).

LinguList commented 5 years ago

A, wait: did you already do that, separating them things by a space? If so, we can start adding IPA right away. I'd just like to know what the * on some entries means. Did you check this with the book?

Wu-Urbanek commented 5 years ago

In the profile I made, columns are separated by tabs. The first column is the phonetic symbol, the second column is the symbol appears the languages (separated by white space). I also wonder what the means. The book gives a summary of the symbols (but I didn't see the there), and then tables of phoneme inventories. I didn't check the tables. But I know that the “th“, ”dh“ are actually tʰ, dʰ

LinguList commented 5 years ago

Okay, then we'll only need to add the specific Chinese language names to the other names we use. We should move this into lexibank, where we can more properly handle the segmentation.

LinguList commented 5 years ago

Hi again, @MacyL, in fact, we need to do this in another way, as there was a misunderstanding. I want the following profile:

Graphemes IPA Strucutre
k k i
ei ei n
ek e k n c

You see? Now in your listing, you do not provide this information. So my idea was in fact not to go and make a code that converts the list by Doug into a two-column thing, but to semi-manually turn this already into this kind of profile we need. This would also require that you compare for each language with the source, while doing this. It is easier than typing off, but we need more than parsing all in one file here. I would even say: keep all languages distinct for now, so start with one file per language. It's in fact the same what I did for the data in Liu2008, which I showed you.

LinguList commented 5 years ago

Check this file as an example for the output I wanted. @Schweikhard may also help here, since the rules are similar for the preparation. We can meet on Thursday and I can explain more, why this is so important.

Wu-Urbanek commented 5 years ago

The python script is updated New orthography profiles are here : https://github.com/lingpy/calc-workflow/tree/master/code/Phoneme_Inventories Each file is a language's phoneme inventories, with 3 columns : Grapheme, IPA, Template The columns we need to fill in are IPA and Template.

LinguList commented 5 years ago

Sorry, when I saw this, I realized it is still better to go with one big file. I still think it is good to have one phoneme inventory per doculect, but filling them out will be easier when you have one big file.

This means, I suggest, you proceed as follows:

  1. get the list of all segments, with four columns: Grapheme, IPA, Structure, Note
  2. fill them out (please follow closely the structure I used, pay attention to differences for the first file I made there)
  3. use your script to make an individual profile for each doculect, as in Phoneme inventories you did, this is for documentation

So we still do per doculect, but in order to annotate, we do it for all at once, and THEN you split, and THEN you check with the book.

does that sound okay to you @MacyL ?

Wu-Urbanek commented 5 years ago

Updated the code. I keep the template as yesterday but add IPA and template columns. So it looks like this: Grapheme, IPA, Template, Note. The columns are all tab separated, the languages in the note are white space separated. https://github.com/lingpy/calc-workflow/blob/master/code/summarised_orthography.tsv

LinguList commented 5 years ago

Okay, then you can start to add the CLTS/IPA items. Remember to look at my example, as it contains many hints on how this should be done. If @Schweikhard has time, you could also look at this quickly, and maybe do a check for consistency, once @MacyL is finished? Then I'd check it last.

Wu-Urbanek commented 5 years ago

One thing I found in Chen's book :

Or shows in loanwords For example, page 261 高坡