Help with Interpretation branch

PromiseDodzi commented 5 months ago

I have carefully reproduced the workflow using the CLDF transformed data. I have logged the reproduction code in branch 4, "Interpretation" of the repo for inspection. The only thing I modify is the coverage number from 750 to 288 (we want 20 languages, and 288 allows us to have this).

When we inspect the alignments in EDICTOR, three main issues come up. We will be grateful if you can help us resolve them.

we will want to replace colons in the data based on this rule: _

Where there is a colon after a grapheme to signal durationality, we consider that the colon is a reduplication of the preceding vowel.The colon is therefore replaced with the preceding sound. If the tone on the segment preceding the colon is a level tone, the repeated segment carries the same tone; If the tone on the segment preceding the colon is a falling tone (x̂), the first segment carries a high tone, and the repeated segment carries a low tone; If the tone on the segment preceding the colon is a rising tone (x̌), the first segment carries a low tone, and the second segment carries a high tone. Between segment one and segment two (colon replacement), a dot is placed between them, to indicate their belongingness to one unit (eg. ǔ → ù.ú)

_

We will like to know if it is possible for forms such as the DogulDomBendiely form below to be aligned such that an extra space is created for the second "à" instead of them being stuck together as they currently are.

Capture

We will like to do this: lex= LexStat algorithm; lex.get_scorer(runs=10000; lex.cluster(method="lexstat", threshold=0.55, ref="cogid") . When I replace the Partial that you use with these lines, an error is thrown up.

Thank you

LinguList commented 5 months ago

So the "colon" is the length marker, right? If you want to replace them, I kindly ask you to modify the orthography profile and make a pull request. You can then also run the cldfbench lexibank.makecldf command, but maybe it is better if you run it locally and only submit the revised orthography profile so we can check it. You should go through the orthoprofile in etc/orthography.tsv and just modify as you see fit. We'll then check from there.

LinguList commented 5 months ago

I mentioned in the seminar that there are TWO representations for all cases like á.à, namely, what you find in the column Grouped_Segments, which is á.à, and what you find in the column Segments, which is already á à. So you do not need me to replace anything nor do you need to touch the orthography profile, you just need to adjust the command that starts with pyedictor, that I mentioned in the course, and replace tokens:grouped_segments by tokens:segments. Let me know if you do not find this part, so I can point you to the full command.

LinguList commented 5 months ago

I advise against running LexStat, since the data are specifically created with partial cognates, so we want partial cognates to be displayed and aligned. Using LexStat is in my opinion scientifically wrong here. So before we change my example to account for this (which is possible, but will be ugly), I'd like to understand the motivation.

PromiseDodzi commented 5 months ago

So the "colon" is the length marker, right? If you want to replace them, I kindly ask you to modify the orthography profile and make a pull request. You can then also run the cldfbench lexibank.makecldf command, but maybe it is better if you run it locally and only submit the revised orthography profile so we can check it. You should go through the orthoprofile in etc/orthography.tsv and just modify as you see fit. We'll then check from there.

Yes, the colon is the length marker. Alright, i'll do just that.

PromiseDodzi commented 5 months ago

I mentioned in the seminar that there are TWO representations for all cases like á.à, namely, what you find in the column Grouped_Segments, which is á.à, and what you find in the column Segments, which is already á à. So you do not need me to replace anything nor do you need to touch the orthography profile, you just need to adjust the command that starts with pyedictor, that I mentioned in the course, and replace tokens:grouped_segments by tokens:segments. Let me know if you do not find this part, so I can point you to the full command.

Ok. I will make sure to inform you if there are any issues

PromiseDodzi commented 5 months ago

I advise against running LexStat, since the data are specifically created with partial cognates, so we want partial cognates to be displayed and aligned. Using LexStat is in my opinion scientifically wrong here. So before we change my example to account for this (which is possible, but will be ugly), I'd like to understand the motivation.

Kindly ignore this request. This is out of curiosity. We will just maintain what you advise. Thank you

PromiseDodzi commented 3 months ago

So the "colon" is the length marker, right? If you want to replace them, I kindly ask you to modify the orthography profile and make a pull request. You can then also run the cldfbench lexibank.makecldf command, but maybe it is better if you run it locally and only submit the revised orthography profile so we can check it. You should go through the orthoprofile in etc/orthography.tsv and just modify as you see fit. We'll then check from there.

@LinguList I initially thought my attempt at this was succesful. Upon careful inspection however, I realize the modified orthography profile does not seem to affect the results that are visible in the "heathdogon-grouped.tsv"/"heathdogon-grouped.tsv" data or in the alignments outputed i.e. "heathdogon-ungrouped-shortened-aligned.tsv" and "heathdogon-grouped-shortened-aligned.tsv"- there are still some colons. I have logged all the code in the "interpretation" branch for inspection. I produce the workflow you advise, and only substitute the orthography profile in etc/orthography.tsv with my modified version. The modified orthography seems not to be affecting in the results. I have also tried to run a cldfbench lexibank.makecldf command with the modified orthography profile in there, and then reproduce the workflow you advice, but still get the same results. I should be grateful if you could help me resolve this.

LinguList commented 3 months ago

Did you make a simple check on the orthography profile? You just delete all lines and see if this has an effect? This hsould result in many errors. You can also paste the output here or check TRANSCRIPTION.md, all these files provide some more information.

LinguList commented 3 months ago

Check here first: https://github.com/languageorphans/heathdogon/blob/interpretation/cldf/forms.csv

This should show if your orthoprofile changes took action.

LinguList commented 3 months ago

The column "Segments" and "Grouped_Segments" should be the ones where you find changes.

PromiseDodzi commented 3 months ago

Check here first: https://github.com/languageorphans/heathdogon/blob/interpretation/cldf/forms.csv

This should show if your orthoprofile changes took action.

The changes are now visible here

PromiseDodzi commented 3 months ago

The column "Segments" and "Grouped_Segments" should be the ones where you find changes.

Thank you so much

PromiseDodzi commented 3 months ago

Did you make a simple check on the orthography profile? You just delete all lines and see if this has an effect? This hsould result in many errors. You can also paste the output here or check TRANSCRIPTION.md, all these files provide some more information.

I think it was more of a procedural issue. As i followed your suggestion of doing a delete-and-see check, I have been able to resolve it. Thank you very much.

PromiseDodzi commented 3 months ago

I mentioned in the seminar that there are TWO representations for all cases like á.à, namely, what you find in the column Grouped_Segments, which is á.à, and what you find in the column Segments, which is already á à. So you do not need me to replace anything nor do you need to touch the orthography profile, you just need to adjust the command that starts with pyedictor, that I mentioned in the course, and replace tokens:grouped_segments by tokens:segments. Let me know if you do not find this part, so I can point you to the full command.

Ok. I will make sure to inform you if there are any issues

@LinguList when i run the pyedictor command that has Segments: tokens, i.e. the ungrouped command in the makefile on data converted with my modified orthography profile, i have an error message which i paste below:

edictor wordlist --dataset="cldf/cldf-metadata.json" \ --namespace='{"id": "local_id", "language_id": "doculect", "variety": "variety", "concept_name": "concept","value": "value", "form": "form", "segments": "tokens","plural_segments": "plural_tokens", "comment": "note", "concept_swadesh": "swadesh"}' \ --name="heathdogon-ungrouped" Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "C:\Users\Promise Dodzi Kpoglu\AppData\Local\Programs\Python\Python312\Scripts\edictor.exe__main__.py", line 7, in File "C:\Users\Promise Dodzi Kpoglu\AppData\Local\Programs\Python\Python312\Lib\site-packages\pyedictor\cli.py", line 167, in main return _cmd_by_name(args.subcommand)(args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Promise Dodzi Kpoglu\AppData\Local\Programs\Python\Python312\Lib\site-packages\pyedictor\cli.py", line 135, in call get_lexibase( File "C:\Users\Promise Dodzi Kpoglu\AppData\Local\Programs\Python\Python312\Lib\site-packages\pyedictor\sqlite.py", line 12, in get_lexibase wordlist = Wordlist.from_cldf( ^^^^^^^^^^^^^^^^^^^ File "C:\Users\Promise Dodzi Kpoglu\AppData\Local\Programs\Python\Python312\Lib\site-packages\lingpy\basic\wordlist.py", line 1204, in from_cldf D[idx] = [datatypes.get( ^^^^^^^^^^^^^^ File "C:\Users\Promise Dodzi Kpoglu\AppData\Local\Programs\Python\Python312\Lib\site-packages\lingpy\basictypes.py", line 58, in init self.n = [strings(x) for x in (' '.join(iterable).split(sep) if not ^^^^^^^^^^^^^^^^^^ TypeError: sequence item 3: expected str instance, NoneType found make: *** [Makefile:14: ungrouped] Error 1

I again have logged the updated code in the interpretation branch. Could you please help me resolve this issue to?

LinguList commented 3 months ago

Do you have all columns listed in NameSpace? The error is bcause some form you want to split (tokens) and it seems to be empty. This is probably the issue. You should please check forms.csv in cldf again.

LinguList commented 3 months ago

And make sure you ALWAYS have a value in Segments.

PromiseDodzi commented 3 months ago

Do you have all columns listed in NameSpace? The error is bcause some form you want to split (tokens) and it seems to be empty. This is probably the issue. You should please check forms.csv in cldf again.

Apart from variety,concept_name and concept_swadesh, all the columns listed in NameSpace are in forms.csv. I have checked the Segments forms.csv again and there is no empty cell in the Segments column.

PromiseDodzi commented 3 months ago

And make sure you ALWAYS have a value in Segments.

I have checked again, and the Segments column in forms.csv always has a value

LinguList commented 3 months ago

I cannot replicate your error, sorry.

git clone https://github.com/languageorphans/heathdogon
git branch interpretation
git checkout interpretation
git config pull.rebase false # must do this if you haven't defined it globally
git pull origin interpretation
# must merge file `raw/Dogon.comp.vocab.UNICODE-2017.lexicon.csv` manually, after problem here
cldfbench lexibank.makecldf lexibank_heathdogon.py # make sure to add paths, etc. not shown here

Then:

pip install edictor[lingpy] # supercedes pyedictor

Then:

edictor wordlist --dataset=cldf/cldf-metadata.json --namespace='{"id": "local_id", "language_id": "doculect", "variety": "variety", "concept_name": "concept","value": "value", "form": "form", "segments": "tokens","plural_segments": "plural_tokens", "comment": "note", "concept_swadesh": "swadesh"}' --name=heathdogon-ungrouped

LinguList commented 3 months ago

heathdogon-grouped.tsv.zip

PromiseDodzi commented 3 months ago

I cannot replicate your error, sorry.

git clone https://github.com/languageorphans/heathdogon
git branch interpretation
git checkout interpretation
git config pull.rebase false # must do this if you haven't defined it globally
git pull origin interpretation
# must merge file `raw/Dogon.comp.vocab.UNICODE-2017.lexicon.csv` manually, after problem here
cldfbench lexibank.makecldf lexibank_heathdogon.py # make sure to add paths, etc. not shown here

Then:

pip install edictor[lingpy] # supercedes pyedictor

Then:

edictor wordlist --dataset=cldf/cldf-metadata.json --namespace='{"id": "local_id", "language_id": "doculect", "variety": "variety", "concept_name": "concept","value": "value", "form": "form", "segments": "tokens","plural_segments": "plural_tokens", "comment": "note", "concept_swadesh": "swadesh"}' --name=heathdogon-ungrouped

S

I cannot replicate your error, sorry.

git clone https://github.com/languageorphans/heathdogon
git branch interpretation
git checkout interpretation
git config pull.rebase false # must do this if you haven't defined it globally
git pull origin interpretation
# must merge file `raw/Dogon.comp.vocab.UNICODE-2017.lexicon.csv` manually, after problem here
cldfbench lexibank.makecldf lexibank_heathdogon.py # make sure to add paths, etc. not shown here

Then:

pip install edictor[lingpy] # supercedes pyedictor

Then:

edictor wordlist --dataset=cldf/cldf-metadata.json --namespace='{"id": "local_id", "language_id": "doculect", "variety": "variety", "concept_name": "concept","value": "value", "form": "form", "segments": "tokens","plural_segments": "plural_tokens", "comment": "note", "concept_swadesh": "swadesh"}' --name=heathdogon-ungrouped

I have done this and still get an error. Can you please help me as to how I can merge manually after I am on the interpretation branch, please:

git clone https://github.com/languageorphans/heathdogon.git
cd heathdogon
git checkout interpretation
#manually merge here - I am a little stuck
cldfbench lexibank.makecldf #paths were added to glottolog, clts and concepticon

Then I:

pip install edictor[lingpy]

finally,I run:

edictor wordlist --dataset=cldf/cldf-metadata.json --namespace='{"id": "local_id", "language_id": "doculect", "variety": "variety", "concept_name": "concept","value": "value", "form": "form", "segments": "tokens","plural_segments": "plural_tokens", "comment": "note", "concept_swadesh": "swadesh"}' --name=heathdogon-ungrouped

I get this error message:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\Promise Dodzi Kpoglu\AppData\Local\Programs\Python\Python312\Scripts\edictor.exe\__main__.py", line 7, in <module>
  File "C:\Users\Promise Dodzi Kpoglu\AppData\Local\Programs\Python\Python312\Lib\site-packages\edictor\cli.py", line 268, in main
    return _cmd_by_name(args.subcommand)(args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Promise Dodzi Kpoglu\AppData\Local\Programs\Python\Python312\Lib\site-packages\edictor\cli.py", line 228, in __call__
    get_wordlist(
  File "C:\Users\Promise Dodzi Kpoglu\AppData\Local\Programs\Python\Python312\Lib\site-packages\edictor\wordlist.py", line 70, in get_wordlist
    wordlist = lingpy.Wordlist.from_cldf(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Promise Dodzi Kpoglu\AppData\Local\Programs\Python\Python312\Lib\site-packages\lingpy\basic\wordlist.py", line 1204, in from_cldf
    D[idx] = [datatypes.get(
              ^^^^^^^^^^^^^^
  File "C:\Users\Promise Dodzi Kpoglu\AppData\Local\Programs\Python\Python312\Lib\site-packages\lingpy\basictypes.py", line 58, in __init__
    self.n = [strings(x) for x in (' '.join(iterable).split(sep) if not
                                   ^^^^^^^^^^^^^^^^^^
TypeError: sequence item 3: expected str instance, NoneType found

LinguList commented 3 months ago

Please do the following now: zip the cldf-folder, or zip the entire folder that you use and send it to me via email or via shared cloud. Maybe, start with the CLDF folder, okay? The error points to a problem in the CLDF, there is no other way.

Do you have local changes not submitted? What does git status tell you in your interpretation branch?

PromiseDodzi commented 3 months ago

Please do the following now: zip the cldf-folder, or zip the entire folder that you use and send it to me via email or via shared cloud. Maybe, start with the CLDF folder, okay? The error points to a problem in the CLDF, there is no other way.

Do you have local changes not submitted? What does git status tell you in your interpretation branch?

Alright. I am sending it to you right away. git status tells me I am on the interpretation branch, the branch is up to date and I have nothing to commit. This is the message on my terminal after I git status:

PS C:\Users\Promise Dodzi Kpoglu\temp\heathdogon> git status
On branch interpretation
Your branch is up to date with 'origin/interpretation'.

nothing to commit, working tree clean

LinguList commented 3 months ago

Okay, it turns out your data HAS empty segments, as I can confirm here:

In [15]: from pycldf import Dataset

In [16]: ds = Dataset.from_metadata("cldf/cldf-metadata.json")

In [17]: for form in ds.objects("FormTable"):
    ...:     try: " ".join(form.cldf.segments)
    ...:     except: print(form.id)
    ...: 
Najamba-6292_neck-1
BenTey-4227_footprint-1
BankanTey-7713_short-1
BenTey-7713_short-1
BankanTey-3426_deepholewell-1
Najamba-9293_voiceofsbcharacteristiccallofanimal-1
BenTey-5820_makeaholeinwoodenhandle-1
Nanga-5820_makeaholeinwoodenhandle-1
Nanga-2055_awlforpenetratingleather-1

LinguList commented 3 months ago

Let us modify the last statement to narrow this down:

In [18]: for form in ds.objects("FormTable"):
    ...:     try: " ".join(form.cldf.segments)
    ...:     except: print(form.id, form.cldf.segments)
    ...: 
Najamba-6292_neck-1 ['m', 'ɔ̀', 'ɔ̀', None]
BenTey-4227_footprint-1 ['l', 'ɔ̀', 's', 'ɔ̀', '-', 't', 'ɔ̀', 'ɔ̀', None]
BankanTey-7713_short-1 ['ɡ', 'ɔ̀', 'ɔ̀', None]
BenTey-7713_short-1 ['ɡ', 'ɔ̀', 'ɔ̀', None, 'w']
BankanTey-3426_deepholewell-1 ['n', 'ɔ̀', 'ɔ̀', None, 'w', '∼']
Najamba-9293_voiceofsbcharacteristiccallofanimal-1 ['j', 'ɔ̀', 'ɔ̀', None]
BenTey-5820_makeaholeinwoodenhandle-1 ['ɡ', 'ɔ̀', 'ɔ̀', None, '-', 'ɡ', 'ɔ̌']
Nanga-5820_makeaholeinwoodenhandle-1 ['ɡ', 'ɔ̀', 'ɔ̀', None, '-', 'ɡ', 'ɔ̀', 'ɔ́']
Nanga-2055_awlforpenetratingleather-1 ['k', 'ɛ̀', 'm', 'ɛ̀', '-', 'ɡ', 'ù', 's', 'ù', '-', 'ɡ', 'ɔ̀', 'ɔ̀', None]

LinguList commented 3 months ago

Let us now check the file forms.csv.

LinguList commented 3 months ago

For the first form-id, I find:

Najamba-6292_neck-1,,Najamba,6292_neck,mɔ᷈:\\mɔ̌ɛ̀,mɔ᷈:,m ɔ̀ ɔ̀ ,,heathdogon,,,,,m ɔ̀.ɔ̀.,m ɔ̌ ɛ̀,m ɔ̌ ɛ̀,mɔ̌ɛ̀,

So you have a trailing space in m ɔ̀ ɔ̀, this is the error, and I am sure this comes from your profile.

LinguList commented 3 months ago

In the data which I checked out, I find, on the contrary:

Najamba-6292_neck-1,,Najamba,6292_neck,mɔ᷈:\\mɔ̌ɛ̀,mɔ᷈:,m ɔ̌ː,,heathdogon,,,,,m ɔ̌ː,m ɔ̌ ɛ̀,m ɔ̌ ɛ̀,mɔ̌ɛ̀,

So the error lies most likely in the profile, and since I just created the data from the branch, it means that your profile shows some problems that mine does not show, or my merging corrected the error.

LinguList commented 3 months ago

The error in the profile is here in line 173:

ɔ᷈: ɔ̀.ɔ̀.

The dot in the end is wrong, and yields an interpretation as None.

LinguList commented 3 months ago

@PromiseDodzi, if you make sure to not have any dots as ultimate symbol in IPA, your code should be fine.

PromiseDodzi commented 3 months ago

@PromiseDodzi, if you make sure to not have any dots as ultimate symbol in IPA, your code should be fine.

Oh okay. Let me go through the orthography profile, and try everything again to see then. Thank you @LinguList

PromiseDodzi commented 3 months ago

The error in the profile is here in line 173:
ɔ᷈:   ɔ̀.ɔ̀.
The dot in the end is wrong, and yields an interpretation as None.

It works now. Thank you very much

languageorphans / heathdogon

Help with Interpretation branch #11