lessersunda / lexirumah-data

Data for the lexirumah clld
0 stars 2 forks source link

Could we add a cldfbench-cldf dataset for the data? #123

Closed LinguList closed 2 years ago

LinguList commented 3 years ago

We will add the concept list to concepticon 2.5 (the PR was just merged), and since this may have changes on the data (since our links to concepticon include more recently added concepts), it would be useful to do the CLDF conversion with cldfbench/pylexibank, to make sure that the data also runs through our quality checks.

If you allow, @Anaphory, I could either add an extra repository, or I could make a PR where the data from raw is processed with the help of cldfbench. An extra repository may be easier, as it preserves the unified access to cldf data, but it is in fact not really needed, one would just add the cldfbench_lexiruhma.py script in the main directory.

Anaphory commented 3 years ago

I think it would be marginally confusing to have https://github.com/lessersunda/lexirumah-data/blob/with_lexi_data/pylexirumah/lexibank.py and also a separate repository for lexibank. If you put a PR for this repository here through, I'll happily merge it. It also increases the chance that it gets updated if in the future editing and expanding LexiRumah is taken up again, there is a possibility for that.

Please have a check that the h is the last letter whereever you refer to this dataset, LexiRumah, not LexiRuhma.

LinguList commented 3 years ago

Alright! This would overwrite the code in cldf. Do you have any specific requirements or are you fine with the most recent cldf release?

LinguList commented 3 years ago

And one more question: the current dump of the data is in raw/forms.csv, right?

Anaphory commented 3 years ago

No, we have been working with CLDF throughout instead of weird other formats, so the current version of the data (9 months old) is in cldf/forms.csv, with everything else CLDF being in there. I must admit I don't remember why there is still data in raw/. The files seem to be largely identical, apart from two commits in 2019 that made them diverge.

Did you have a look at the existing lexibank entry point? It very much formalizes the policy of “this is a CLDF dataset already, just make it available”.

LinguList commented 3 years ago

Where do I find the lexibank entry point? All that I think would needed to be done is 1) include concepticon mappings for 2.5, 2) make mappings for those sounds which are not strict CLTS. So an additional orthography profile. Maybe it is then enough to just modify the entry point, but I did not find it when checking the code from the web interface of github.

Anaphory commented 3 years ago

It's actually the file I linked to above: https://github.com/lessersunda/lexirumah-data/blob/with_lexi_data/pylexirumah/lexibank.py

LinguList commented 3 years ago

I just had a look and it is not straightforward to use our data check routines on top of the entry point as the entry point assumes a different structure (file etc for languages and orthoprofiles, etc.), so it is not trivial to run the checks. All I could do is provide two scripts: one checks with the actual Concepticon for data mappings, and one which checks with CLTS, so the sound conversions in the orthography profile are in line with other datasets and can be directly compared with Phoible and the like.

Anaphory commented 3 years ago

I just had a look and it is not straightforward to use our data check routines on top of the entry point as the entry point assumes a different structure (file etc for languages and orthoprofiles, etc.), so it is not trivial to run the checks.

I'm sure we could add an etc file (or folder) to the lexirumah-data repository, and I still think that would be the more robust solution considering potential future extensions compared to keeping them in a completely separate repository.

But I have some difficulties understanding the context. Presumably, we have Concepticon mappings and segments that conform to an older version of CLTS already in concepts.csv respectively forms.csv and languages already in lects.csv. The point of CLDF is to have a standard structure for datasets, after all. I would consider any lack of conformity a bug of the LexiRumah dataset which should be fixed right here, is there a conceptual problem with that?

All I could do is provide two scripts: one checks with the actual Concepticon for data mappings, and one which checks with CLTS, so the sound conversions in the orthography profile are in line with other datasets and can be directly compared with Phoible and the like.

It makes sense to add the checks that forms and concepts really conform to CLTS and Concepticon (instead of just promising that they are, without checks) to this repository, but I don't understand why this is a problem. Whether a future pull request contains changes to one file or three doesn't make much of a difference.

Can you point me to another CLDF dataset example which might help me understand the problem and what needs to be done?

LinguList commented 3 years ago

The conceptlists are integrated via cldfbench, and specified in the metadata file of a wordlist repository (as given in the lexibank organization): https://github.com/lexibank/wanghmongmien/

Versions are then also written to the cldf/cldf-metadata.json, so we know which version of Concepticon was used.

With CLTS, this is similar (version specified), and we have ways to check for prosodic problems (segmentations that lead to empty morphemes, for example).

But if you have ways to check without using cldfbench, this whole issue just can be closed. Then I just informed you with this issue that an updated version is already in concepticon's master, and can be used in the future, if concepts don't change, and has the advantage of being curated by many people and also regularly updated.

LinguList commented 3 years ago

The identifier of the list is Klamer-2018-607. If you have any demands on modifying or suggestions, these will also be welcome. We will probably release the next version of Concepticon next week.

LinguList commented 3 years ago

I just checked the segments with CLTS, turns out there are only two cases:

------------------------------  ------------------------------------  --
lexirumah-urua1244-eleven-1     p u tç a _ r e s i n _ s a            tç
lexirumah-urua1244-twelve-1     p u tç a _ r e s i n _ n u a          tç
lexirumah-urua1244-eighteen-1   p u tç a _ r e s i n _ t e r i n u a  tç
lexirumah-urua1244-fifteen-1    p u tç a _ r e s i _ n i m a          tç
lexirumah-urua1244-fourteen-1   p u tç a _ r e s i n _ f a t          tç
lexirumah-urua1244-nineteen-1   p u tç a _ r e s i n _ s a p u t i    tç
lexirumah-urua1244-seventeen-1  p u tç a _ r e s i n _ t a r a n s a  tç
lexirumah-urua1244-sixteen-1    p u tç a _ r e s i _ n e m            tç
lexirumah-urua1244-thirteen-1   p u tç a _ r e s i n _ t e n i        tç
lexirumah-maib1239-salt-2       p o - k a s                           -
------------------------------  ------------------------------------  --

The tç should rather be simply c or .

LinguList commented 3 years ago

There are furthermore about 750 forms that have phonotactic properties which I consider problematic (e.g, for lingpy alignments), as they have repeated _ _ word breaks, or repeated + + morpheme breaks, or initial or final breaks. These are better indicated in an extra column, since this is typically used to indicate something is a clitic, but this is word-form external information better handled in an extra column.

segments.txt

LinguList commented 3 years ago

Last not least, here are the modified concept sets (comments welcome, as I figured I had some problems in my mappings, but went over them carefully again now).

English Concepticon ID (OLD) Concepticon ID (New) Concepticon Gloss (New)
tokay gecko 2355
return 142 581 COME BACK
know 1410 3626 KNOW
cuscus 470
one hundred thousand 2078 3532 ONE HUNDRED THOUSAND
traditional house 1252
hide 602 2486 HIDE
louse in hair; head louse; mother louse 1392 310 HEAD LOUSE
day 1260 1225 DAY (NOT NIGHT)
smell 1586 2124 SMELL
2sg; 2pl ? ? ?
to chase away, to expel 30
think 1415 2271 THINK
penalty 1196
thin (non-human) 2307 2308 THIN
command; order 1128 1998 COMMAND
nephew; niece 173 3890 NEPHEW OR NIECE
below 2094 1485 BELOW OR UNDER
hit (drum) 11
search for; to hunt for 1468
skewer 398
father's sister 170 2691 PATERNAL AUNT (FATHER'S SISTER)
ridge; ridgepole; peak; tip 280 1748 RIDGE
rice grain head 2749
scared 1419 3033 SCARED
burn (clear land) 141 3539 BURN LAND
mother's brother 1984 2692 MATERNAL UNCLE (MOTHER'S BROTHER)
above 2379 1741 ABOVE
rule; govern 382 1846 RULE
fall over 1280 2894 TUMBLE (FALL DOWN)
blow 176 175 BLOW (OF WIND)
(coral) reef 660
dry (in sun) 2015 3364 DRY IN SUN
finished 1766
husked rice; uncooked rice 926 3289 UNCOOKED RICE
chase; run after 1085
betel vine 117 177 BETEL PEPPER VINE
sweat 125 2458 PERSPIRE OR SWEAT
leech 949 2273 LEECH
sleepy 1757 3620 SLEEPY
carry 413 700 CARRY
coconut shell 2649
LinguList commented 3 years ago

And a last remark: Since the word boundary vs. morpheme boundary is also a characteristics of the morpheme relations rather than the morpheme form itself, I'd also suggest to use only one character for separation. I was of a different opinion some time ago, but given the problems this created in lingpy's alignments at times, I am now a supporter of one separator only. Yet this is not in any form requested for by CLDF.

Anaphory commented 3 years ago

So I think

are easy to do. They are, reasonably, bugfixes or added features, so they would go through with a minor version or a patch number.

For the comments about the morpheme boundaries (unify them, and denote in a separate column what their nature is – affix, clitic, separate word; circumfix or not) I'm hesitant to change that for now. It would be a major change of the data, warranting a new major version, and also such a column is not part of CLDF yet, so it would lose information that is currently accessible to CLDF-compatible software in principle.

I'm still not sure in which API format the validators for concepticon and CLTS would be needed. The wanghmongmien dataset has no such tests in its lexibank entry point, and the test.py only validates the dataset, independent of Concepticon and CLTS.

LinguList commented 3 years ago

I understand completely with the morpheme boundaries, and it is easy to handle this in a few lines of code, when doing alignments and the like, and it is -- as you point out -- not part of any CLDF requirement.