lexibank / lexibank-analysed

Study on lexibank data (presenting the lexibank dataset).
Creative Commons Attribution 4.0 International
10 stars 3 forks source link

Bliss2 #48

Closed LinguList closed 1 year ago

LinguList commented 1 year ago

Here's the latest addons with forms etc. created with Glottolog 4.7 and Concepticon 3.1

johenglisch commented 1 year ago

I just fixed a minor nitpick because that if statement in _add_languages behaves weirdly if language.name has a None value:

>>> language_name = None
>>> str(language_name).strip()
'None'
>>> not str(language_name).strip() or language_name == 'None'
False

Again, it's a nitpick -- we don't seem to have any None values in our data right now. But it seems a bit more robust.

Other than that this looks good to go.

LinguList commented 1 year ago

I'm of course fine by refining that part. Running the code costs some 30+ minutes on my office computer, so if we agree it is save to replace the line without running again, it seems easiest to me.

LinguList commented 1 year ago

And if it is easy for you to add that right away, @johenglisch, please do so :)

johenglisch commented 1 year ago

please do so

I already did (I hope that wasn't too hasty).

if we agree it is save to replace the line without running again

I had it run in the background this morning just to make sure and the line change didn't affect anything.

LinguList commented 1 year ago

Super, if you are all fine, we can merge then and also release this as version 1.0 of lexibank :)

johenglisch commented 1 year ago

I was confused for a second because I thought this was already released. Turns out that was v0.2. (^^)

I re-ran pytest just in case and the validator seems to have a few complaints: (side note: it might not be a bad idea to get CI set up)

cldf/frequencies.csv:6704 Key `servamalagasy-MerinaMaevatanana` not found in table languages.csv
cldf/frequencies.csv:6757 Key `servamalagasy-MerinaMaevatanana` not found in table languages.csv
cldf/frequencies.csv:6829 Key `servamalagasy-MerinaMaevatanana` not found in table languages.csv
cldf/frequencies.csv:7254 Key `servamalagasy-MerinaMaevatanana` not found in table languages.csv
cldf/frequencies.csv:7309 Key `servamalagasy-MerinaMaevatanana` not found in table languages.csv
cldf/frequencies.csv:25303 Key `servamalagasy-MerinaMaevatanana` not found in table languages.csv
cldf/frequencies.csv:25478 Key `servamalagasy-MerinaMaevatanana` not found in table languages.csv
cldf/frequencies.csv:25772 Key `servamalagasy-MerinaMaevatanana` not found in table languages.csv
cldf/frequencies.csv:25875 Key `servamalagasy-MerinaMaevatanana` not found in table languages.csv
cldf/frequencies.csv:25993 Key `servamalagasy-MerinaMaevatanana` not found in table languages.csv
cldf/frequencies.csv:26076 Key `servamalagasy-MerinaMaevatanana` not found in table languages.csv
cldf/frequencies.csv:26137 Key `servamalagasy-MerinaMaevatanana` not found in table languages.csv
cldf/frequencies.csv:26359 Key `servamalagasy-MerinaMaevatanana` not found in table languages.csv
cldf/frequencies.csv:29243 Key `servamalagasy-MerinaMaevatanana` not found in table languages.csv
cldf/frequencies.csv:32382 Key `servamalagasy-MerinaMaevatanana` not found in table languages.csv
cldf/frequencies.csv:38129 Key `servamalagasy-MerinaMaevatanana` not found in table languages.csv
cldf/frequencies.csv:41875 Key `servamalagasy-MerinaMaevatanana` not found in table languages.csv
cldf/frequencies.csv:48948 Key `servamalagasy-MerinaMaevatanana` not found in table languages.csv
cldf/frequencies.csv:51933 Key `servamalagasy-MerinaMaevatanana` not found in table languages.csv
cldf/frequencies.csv:54666 Key `servamalagasy-MerinaMaevatanana` not found in table languages.csv
cldf/frequencies.csv:55621 Key `servamalagasy-MerinaMaevatanana` not found in table languages.csv
cldf/frequencies.csv:56496 Key `servamalagasy-MerinaMaevatanana` not found in table languages.csv
cldf/frequencies.csv:58494 Key `servamalagasy-MerinaMaevatanana` not found in table languages.csv
cldf/frequencies.csv:61189 Key `servamalagasy-MerinaMaevatanana` not found in table languages.csv
cldf/frequencies.csv:65049 Key `servamalagasy-MerinaMaevatanana` not found in table languages.csv
cldf/frequencies.csv:67499 Key `servamalagasy-MerinaMaevatanana` not found in table languages.csv
cldf/frequencies.csv:69964 Key `servamalagasy-MerinaMaevatanana` not found in table languages.csv
cldf/frequencies.csv:74773 Key `servamalagasy-MerinaMaevatanana` not found in table languages.csv
cldf/frequencies.csv:85456 Key `servamalagasy-MerinaMaevatanana` not found in table languages.csv
cldf/frequencies.csv:87510 Key `servamalagasy-MerinaMaevatanana` not found in table
 languages.csv
cldf/frequencies.csv:90267 Key `servamalagasy-MerinaMaevatanana` not found in table
 languages.csv
cldf/frequencies.csv:93957 Key `servamalagasy-MerinaMaevatanana` not found in table languages.csv
cldf/frequencies.csv:96601 Key `servamalagasy-MerinaMaevatanana` not found in table languages.csv
cldf/frequencies.csv:99419 Key `servamalagasy-MerinaMaevatanana` not found in table languages.csv
cldf/frequencies.csv:102376 Key `servamalagasy-MerinaMaevatanana` not found in table languages.csv
cldf/frequencies.csv:106603 Key `servamalagasy-MerinaMaevatanana` not found in table languages.csv
================================ warnings summary =================================
test.py::test_valid
  ENV/lib/python3.8/site-packages/csvw/metadata.py:1372: UserWarning: Unspecified column "Dataset" in table languages.csv
    warnings.warn(
LinguList commented 1 year ago

The servamalagasy key is not found, because the glottocode was modified, and needs to be updated, so it either works with glottolog 4.4, but I thought we better use 4.7. So the tests must be adjusted here.

We have problems of a few duplicates, but in individual datasets. I suggest we leave all cases related to new data releases out for now, using explicitly the 0.2 data but with the new code that more transparently shares forms for some major candidates (which have transcriptions and the like).

johenglisch commented 1 year ago

The servamalagasy key is not found, because the glottocode was modified, and needs to be updated, so it either works with glottolog 4.4, but I thought we better use 4.7.

Hm, somehow that feels off to me. The language is loaded successfully and added to frequencies.csv (and I assume counting sounds also requires access to the actual forms?):

$ csvgrep -c Language_ID -m servamalagasy-MerinaMaevatanana cldf/frequencies.csv \
    | csvcut -c Parameter_ID,Value \
    | head
Parameter_ID,Value
from_unrounded_close_front_to_rounded_close_back_diphthong,1
from_unrounded_close_front_to_unrounded_close-mid_front_diphthong,1
from_unrounded_close_front_to_unrounded_open_front_diphthong,9
from_unrounded_open_front_to_non-syllabic_rounded_close_back_diphthong,7
from_unrounded_open_front_to_non-syllabic_unrounded_close_front_diphthong,8
pre-nasalized_voiced_alveolar_sibilant_affricate_consonant,3
pre-nasalized_voiced_alveolar_stop_consonant,5
pre-nasalized_voiced_bilabial_stop_consonant,4
pre-nasalized_voiced_retroflex_sibilant_affricate_consonant,9

But at the same time its missing from the form table and the language table:

$ unzip -q -c cldf/forms.csv.zip \
    | csvgrep -c Language_ID -m servamalagasy-MerinaMaevatanana \
    | csvcut -c Parameter_ID,Value
Parameter_ID,Value

Intuitively that feels more like a bug in the program rather than a Glottolog version thing. No matter what version of glottolog we're using, I would expect the language ID to appear either in all relevant tables or in none of them (whichever is desired for a specific case).

So the tests must be adjusted here.

I don't quite know what you mean by ‘adjusting the tests’. The tests are just cldf validate.

I suggest we leave all cases related to new data releases out for now, using explicitly the 0.2 data but with the new code that more transparently shares forms for some major candidates.

Do you mean load the data specified in the old lexibank.tsv file rather than the data from the newer lexibank-bliss.tsv? Shouldn't be a problem but I wonder if we're tossing out the baby with the bath water there.

LinguList commented 1 year ago

Yes, that's weird and not very pleasant, that we have this problem now. So there is a bug that must be fixed before releasing it. But my intuition that this is related to the glottocode was right, becase the glottocode is meri1243, which is outdated. The code I wrote tries to exclude these languages from being written to the language table (and to the form table). It is possible, however, that the check was not applied to writing frequencies (!). So it must be applied to all cases.

LinguList commented 1 year ago

By adjusting the test: it was an error from my side, I did not know we just cldf validate here.

LinguList commented 1 year ago

@johenglisch, I found the bug!

in the *py script, we have the lines 385-387 in the function _add_languages

                if language.latitude and language.glottocode and condition(language):
                    _add_language(writer, language, features, attr_features,
                            collection=collection, visited=visited)
                    yield language

The problem is that I introduced a stop to _add_language, if the glottocode is NOT in glottolog (the languoids dictionary). In this case, it looks for a KeyError and returns None. However, as we don't check for the return value of _add_language, it goes on to yield the language object, so frequency is calculated for the invalid glottocode language anyway.

I think, a fix would consist in just rewriting as:

                if language.latitude and language.glottocode and condition(language):
                    if _add_language(writer, language, features, attr_features,
                            collection=collection, visited=visited):
                        yield language
LinguList commented 1 year ago

I can try and rerun this on Monday, if nobody else has time for it. On Tuesday / Wednesday I am not in Leipzig, where I have my fast computer, but would then be back on Thursday.

LinguList commented 1 year ago

I am trying to run it now with the proposed modification.

LinguList commented 1 year ago

Update: I just pushed the new version in which the tests run, but there's a warning that the column Dataset is not specified in the file languages.csv. If any of you has a quick fix for that (as I don't have time to look further in it today), this would be super, as it would mean we can merge already.

LinguList commented 1 year ago

The modification, btw. required to make a check with True and False, since the function always returned None, and now it is explicit in telling if it failed or if it worked.

johenglisch commented 1 year ago

column Dataset is not specified

I'll take a look at that -- theoretically this should only involve adding one line to the schema.

LinguList commented 1 year ago

I just found the problem: our custom language in Lexibank must also define a dataset, which it does not do so far. So I am trying to rerun one more time now, to make sure this fixes the problem.

As we are reaching a rather complex terrain with these intertwined datasets now, we may want to discuss at some point if this can be done in a more straightforward way. But on the other hand: the matter is also not completely trivial and simple, involving quite a few different decisions, so maybe, we need to embrace the fact that these datasets are not done overnight (but that with sqlite they can then be used in many applications).

LinguList commented 1 year ago

Update 2: now, the tests run without an error on my computer. So I think this could then be merged, to publish version 1.0 of lexibank data.

johenglisch commented 1 year ago

I think this could then be merged

Sounds good to me.

publish version 1.0 of lexibank data

Who should do the release? Do you want me to do it or do you want to have the honour of pressing the button yourself? (^^)