Open PromiseDodzi opened 5 months ago
cldfbench lexibank.makecldf
command, but maybe it is better if you run it locally and only submit the revised orthography profile so we can check it. You should go through the orthoprofile in etc/orthography.tsv
and just modify as you see fit. We'll then check from there.á.à
, namely, what you find in the column Grouped_Segments
, which is á.à
, and what you find in the column Segments
, which is already á à
. So you do not need me to replace anything nor do you need to touch the orthography profile, you just need to adjust the command that starts with pyedictor
, that I mentioned in the course, and replace tokens:grouped_segments
by tokens:segments
. Let me know if you do not find this part, so I can point you to the full command.
- So the "colon" is the length marker, right? If you want to replace them, I kindly ask you to modify the orthography profile and make a pull request. You can then also run the
cldfbench lexibank.makecldf
command, but maybe it is better if you run it locally and only submit the revised orthography profile so we can check it. You should go through the orthoprofile inetc/orthography.tsv
and just modify as you see fit. We'll then check from there.
Yes, the colon is the length marker. Alright, i'll do just that.
- I mentioned in the seminar that there are TWO representations for all cases like
á.à
, namely, what you find in the columnGrouped_Segments
, which isá.à
, and what you find in the columnSegments
, which is alreadyá à
. So you do not need me to replace anything nor do you need to touch the orthography profile, you just need to adjust the command that starts withpyedictor
, that I mentioned in the course, and replacetokens:grouped_segments
bytokens:segments
. Let me know if you do not find this part, so I can point you to the full command.
Ok. I will make sure to inform you if there are any issues
- I advise against running LexStat, since the data are specifically created with partial cognates, so we want partial cognates to be displayed and aligned. Using LexStat is in my opinion scientifically wrong here. So before we change my example to account for this (which is possible, but will be ugly), I'd like to understand the motivation.
Kindly ignore this request. This is out of curiosity. We will just maintain what you advise. Thank you
- So the "colon" is the length marker, right? If you want to replace them, I kindly ask you to modify the orthography profile and make a pull request. You can then also run the
cldfbench lexibank.makecldf
command, but maybe it is better if you run it locally and only submit the revised orthography profile so we can check it. You should go through the orthoprofile inetc/orthography.tsv
and just modify as you see fit. We'll then check from there.
@LinguList I initially thought my attempt at this was succesful. Upon careful inspection however, I realize the modified orthography profile does not seem to affect the results that are visible in the "heathdogon-grouped.tsv"/"heathdogon-grouped.tsv" data or in the alignments outputed i.e. "heathdogon-ungrouped-shortened-aligned.tsv" and "heathdogon-grouped-shortened-aligned.tsv"- there are still some colons. I have logged all the code in the "interpretation" branch for inspection. I produce the workflow you advise, and only substitute the orthography profile in etc/orthography.tsv
with my modified version. The modified orthography seems not to be affecting in the results.
I have also tried to run a cldfbench lexibank.makecldf
command with the modified orthography profile in there, and then reproduce the workflow you advice, but still get the same results. I should be grateful if you could help me resolve this.
Did you make a simple check on the orthography profile? You just delete all lines and see if this has an effect? This hsould result in many errors. You can also paste the output here or check TRANSCRIPTION.md, all these files provide some more information.
Check here first: https://github.com/languageorphans/heathdogon/blob/interpretation/cldf/forms.csv
This should show if your orthoprofile changes took action.
The column "Segments" and "Grouped_Segments" should be the ones where you find changes.
Check here first: https://github.com/languageorphans/heathdogon/blob/interpretation/cldf/forms.csv
This should show if your orthoprofile changes took action.
The changes are now visible here
The column "Segments" and "Grouped_Segments" should be the ones where you find changes.
Thank you so much
Did you make a simple check on the orthography profile? You just delete all lines and see if this has an effect? This hsould result in many errors. You can also paste the output here or check TRANSCRIPTION.md, all these files provide some more information.
I think it was more of a procedural issue. As i followed your suggestion of doing a delete-and-see check, I have been able to resolve it. Thank you very much.
- I mentioned in the seminar that there are TWO representations for all cases like
á.à
, namely, what you find in the columnGrouped_Segments
, which isá.à
, and what you find in the columnSegments
, which is alreadyá à
. So you do not need me to replace anything nor do you need to touch the orthography profile, you just need to adjust the command that starts withpyedictor
, that I mentioned in the course, and replacetokens:grouped_segments
bytokens:segments
. Let me know if you do not find this part, so I can point you to the full command.Ok. I will make sure to inform you if there are any issues
@LinguList when i run the pyedictor command that has Segments
: tokens
, i.e. the ungrouped
command in the makefile on data converted with my modified orthography profile, i have an error message which i paste below:
edictor wordlist --dataset="cldf/cldf-metadata.json" \
--namespace='{"id": "local_id", "language_id": "doculect", "variety": "variety", "concept_name": "concept","value": "value", "form": "form", "segments": "tokens","plural_segments": "plural_tokens", "comment": "note", "concept_swadesh": "swadesh"}' \
--name="heathdogon-ungrouped"
Traceback (most recent call last):
File "
I again have logged the updated code in the interpretation
branch. Could you please help me resolve this issue to?
Do you have all columns listed in NameSpace? The error is bcause some form you want to split (tokens) and it seems to be empty. This is probably the issue. You should please check forms.csv in cldf again.
And make sure you ALWAYS have a value in Segments
.
Do you have all columns listed in NameSpace? The error is bcause some form you want to split (tokens) and it seems to be empty. This is probably the issue. You should please check forms.csv in cldf again.
Apart from variety,concept_name and concept_swadesh, all the columns listed in NameSpace are in forms.csv. I have checked the Segments
forms.csv again and there is no empty cell in the Segments column
.
And make sure you ALWAYS have a value in
Segments
.
I have checked again, and the Segments
column in forms.csv always has a value
I cannot replicate your error, sorry.
git clone https://github.com/languageorphans/heathdogon
git branch interpretation
git checkout interpretation
git config pull.rebase false # must do this if you haven't defined it globally
git pull origin interpretation
# must merge file `raw/Dogon.comp.vocab.UNICODE-2017.lexicon.csv` manually, after problem here
cldfbench lexibank.makecldf lexibank_heathdogon.py # make sure to add paths, etc. not shown here
Then:
pip install edictor[lingpy] # supercedes pyedictor
Then:
edictor wordlist --dataset=cldf/cldf-metadata.json --namespace='{"id": "local_id", "language_id": "doculect", "variety": "variety", "concept_name": "concept","value": "value", "form": "form", "segments": "tokens","plural_segments": "plural_tokens", "comment": "note", "concept_swadesh": "swadesh"}' --name=heathdogon-ungrouped
I cannot replicate your error, sorry.
git clone https://github.com/languageorphans/heathdogon git branch interpretation git checkout interpretation git config pull.rebase false # must do this if you haven't defined it globally git pull origin interpretation # must merge file `raw/Dogon.comp.vocab.UNICODE-2017.lexicon.csv` manually, after problem here cldfbench lexibank.makecldf lexibank_heathdogon.py # make sure to add paths, etc. not shown here
Then:
pip install edictor[lingpy] # supercedes pyedictor
Then:
edictor wordlist --dataset=cldf/cldf-metadata.json --namespace='{"id": "local_id", "language_id": "doculect", "variety": "variety", "concept_name": "concept","value": "value", "form": "form", "segments": "tokens","plural_segments": "plural_tokens", "comment": "note", "concept_swadesh": "swadesh"}' --name=heathdogon-ungrouped
S
I cannot replicate your error, sorry.
git clone https://github.com/languageorphans/heathdogon git branch interpretation git checkout interpretation git config pull.rebase false # must do this if you haven't defined it globally git pull origin interpretation # must merge file `raw/Dogon.comp.vocab.UNICODE-2017.lexicon.csv` manually, after problem here cldfbench lexibank.makecldf lexibank_heathdogon.py # make sure to add paths, etc. not shown here
Then:
pip install edictor[lingpy] # supercedes pyedictor
Then:
edictor wordlist --dataset=cldf/cldf-metadata.json --namespace='{"id": "local_id", "language_id": "doculect", "variety": "variety", "concept_name": "concept","value": "value", "form": "form", "segments": "tokens","plural_segments": "plural_tokens", "comment": "note", "concept_swadesh": "swadesh"}' --name=heathdogon-ungrouped
I have done this and still get an error. Can you please help me as to how I can merge manually after I am on the interpretation branch, please:
git clone https://github.com/languageorphans/heathdogon.git
cd heathdogon
git checkout interpretation
#manually merge here - I am a little stuck
cldfbench lexibank.makecldf #paths were added to glottolog, clts and concepticon
Then I:
pip install edictor[lingpy]
finally,I run:
edictor wordlist --dataset=cldf/cldf-metadata.json --namespace='{"id": "local_id", "language_id": "doculect", "variety": "variety", "concept_name": "concept","value": "value", "form": "form", "segments": "tokens","plural_segments": "plural_tokens", "comment": "note", "concept_swadesh": "swadesh"}' --name=heathdogon-ungrouped
I get this error message:
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "C:\Users\Promise Dodzi Kpoglu\AppData\Local\Programs\Python\Python312\Scripts\edictor.exe\__main__.py", line 7, in <module>
File "C:\Users\Promise Dodzi Kpoglu\AppData\Local\Programs\Python\Python312\Lib\site-packages\edictor\cli.py", line 268, in main
return _cmd_by_name(args.subcommand)(args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Promise Dodzi Kpoglu\AppData\Local\Programs\Python\Python312\Lib\site-packages\edictor\cli.py", line 228, in __call__
get_wordlist(
File "C:\Users\Promise Dodzi Kpoglu\AppData\Local\Programs\Python\Python312\Lib\site-packages\edictor\wordlist.py", line 70, in get_wordlist
wordlist = lingpy.Wordlist.from_cldf(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Promise Dodzi Kpoglu\AppData\Local\Programs\Python\Python312\Lib\site-packages\lingpy\basic\wordlist.py", line 1204, in from_cldf
D[idx] = [datatypes.get(
^^^^^^^^^^^^^^
File "C:\Users\Promise Dodzi Kpoglu\AppData\Local\Programs\Python\Python312\Lib\site-packages\lingpy\basictypes.py", line 58, in __init__
self.n = [strings(x) for x in (' '.join(iterable).split(sep) if not
^^^^^^^^^^^^^^^^^^
TypeError: sequence item 3: expected str instance, NoneType found
Please do the following now: zip the cldf-folder, or zip the entire folder that you use and send it to me via email or via shared cloud. Maybe, start with the CLDF folder, okay? The error points to a problem in the CLDF, there is no other way.
Do you have local changes not submitted? What does git status
tell you in your interpretation
branch?
Please do the following now: zip the cldf-folder, or zip the entire folder that you use and send it to me via email or via shared cloud. Maybe, start with the CLDF folder, okay? The error points to a problem in the CLDF, there is no other way.
Do you have local changes not submitted? What does
git status
tell you in yourinterpretation
branch?
Alright. I am sending it to you right away.
git status
tells me I am on the interpretation branch, the branch is up to date and I have nothing to commit.
This is the message on my terminal after I git status
:
PS C:\Users\Promise Dodzi Kpoglu\temp\heathdogon> git status
On branch interpretation
Your branch is up to date with 'origin/interpretation'.
nothing to commit, working tree clean
Okay, it turns out your data HAS empty segments, as I can confirm here:
In [15]: from pycldf import Dataset
In [16]: ds = Dataset.from_metadata("cldf/cldf-metadata.json")
In [17]: for form in ds.objects("FormTable"):
...: try: " ".join(form.cldf.segments)
...: except: print(form.id)
...:
Najamba-6292_neck-1
BenTey-4227_footprint-1
BankanTey-7713_short-1
BenTey-7713_short-1
BankanTey-3426_deepholewell-1
Najamba-9293_voiceofsbcharacteristiccallofanimal-1
BenTey-5820_makeaholeinwoodenhandle-1
Nanga-5820_makeaholeinwoodenhandle-1
Nanga-2055_awlforpenetratingleather-1
Let us modify the last statement to narrow this down:
In [18]: for form in ds.objects("FormTable"):
...: try: " ".join(form.cldf.segments)
...: except: print(form.id, form.cldf.segments)
...:
Najamba-6292_neck-1 ['m', 'ɔ̀', 'ɔ̀', None]
BenTey-4227_footprint-1 ['l', 'ɔ̀', 's', 'ɔ̀', '-', 't', 'ɔ̀', 'ɔ̀', None]
BankanTey-7713_short-1 ['ɡ', 'ɔ̀', 'ɔ̀', None]
BenTey-7713_short-1 ['ɡ', 'ɔ̀', 'ɔ̀', None, 'w']
BankanTey-3426_deepholewell-1 ['n', 'ɔ̀', 'ɔ̀', None, 'w', '∼']
Najamba-9293_voiceofsbcharacteristiccallofanimal-1 ['j', 'ɔ̀', 'ɔ̀', None]
BenTey-5820_makeaholeinwoodenhandle-1 ['ɡ', 'ɔ̀', 'ɔ̀', None, '-', 'ɡ', 'ɔ̌']
Nanga-5820_makeaholeinwoodenhandle-1 ['ɡ', 'ɔ̀', 'ɔ̀', None, '-', 'ɡ', 'ɔ̀', 'ɔ́']
Nanga-2055_awlforpenetratingleather-1 ['k', 'ɛ̀', 'm', 'ɛ̀', '-', 'ɡ', 'ù', 's', 'ù', '-', 'ɡ', 'ɔ̀', 'ɔ̀', None]
Let us now check the file forms.csv
.
For the first form-id, I find:
Najamba-6292_neck-1,,Najamba,6292_neck,mɔ᷈:\\mɔ̌ɛ̀,mɔ᷈:,m ɔ̀ ɔ̀ ,,heathdogon,,,,,m ɔ̀.ɔ̀.,m ɔ̌ ɛ̀,m ɔ̌ ɛ̀,mɔ̌ɛ̀,
So you have a trailing space in m ɔ̀ ɔ̀
, this is the error, and I am sure this comes from your profile.
In the data which I checked out, I find, on the contrary:
Najamba-6292_neck-1,,Najamba,6292_neck,mɔ᷈:\\mɔ̌ɛ̀,mɔ᷈:,m ɔ̌ː,,heathdogon,,,,,m ɔ̌ː,m ɔ̌ ɛ̀,m ɔ̌ ɛ̀,mɔ̌ɛ̀,
So the error lies most likely in the profile, and since I just created the data from the branch, it means that your profile shows some problems that mine does not show, or my merging corrected the error.
The error in the profile is here in line 173:
ɔ᷈: ɔ̀.ɔ̀.
The dot in the end is wrong, and yields an interpretation as None.
@PromiseDodzi, if you make sure to not have any dots as ultimate symbol in IPA, your code should be fine.
@PromiseDodzi, if you make sure to not have any dots as ultimate symbol in IPA, your code should be fine.
Oh okay. Let me go through the orthography profile, and try everything again to see then. Thank you @LinguList
The error in the profile is here in line 173:
ɔ᷈: ɔ̀.ɔ̀.
The dot in the end is wrong, and yields an interpretation as None.
It works now. Thank you very much
I have carefully reproduced the workflow using the CLDF transformed data. I have logged the reproduction code in branch 4, "Interpretation" of the repo for inspection. The only thing I modify is the coverage number from 750 to 288 (we want 20 languages, and 288 allows us to have this).
When we inspect the alignments in EDICTOR, three main issues come up. We will be grateful if you can help us resolve them.
_
Thank you