Could we add a cldfbench-cldf dataset for the data?

LinguList commented 3 years ago

We will add the concept list to concepticon 2.5 (the PR was just merged), and since this may have changes on the data (since our links to concepticon include more recently added concepts), it would be useful to do the CLDF conversion with cldfbench/pylexibank, to make sure that the data also runs through our quality checks.

If you allow, @Anaphory, I could either add an extra repository, or I could make a PR where the data from raw is processed with the help of cldfbench. An extra repository may be easier, as it preserves the unified access to cldf data, but it is in fact not really needed, one would just add the cldfbench_lexiruhma.py script in the main directory.

Anaphory commented 3 years ago

I think it would be marginally confusing to have https://github.com/lessersunda/lexirumah-data/blob/with_lexi_data/pylexirumah/lexibank.py and also a separate repository for lexibank. If you put a PR for this repository here through, I'll happily merge it. It also increases the chance that it gets updated if in the future editing and expanding LexiRumah is taken up again, there is a possibility for that.

Please have a check that the h is the last letter whereever you refer to this dataset, LexiRumah, not LexiRuhma.

LinguList commented 3 years ago

Alright! This would overwrite the code in cldf. Do you have any specific requirements or are you fine with the most recent cldf release?

LinguList commented 3 years ago

And one more question: the current dump of the data is in raw/forms.csv, right?

Anaphory commented 3 years ago

No, we have been working with CLDF throughout instead of weird other formats, so the current version of the data (9 months old) is in cldf/forms.csv, with everything else CLDF being in there. I must admit I don't remember why there is still data in raw/. The files seem to be largely identical, apart from two commits in 2019 that made them diverge.

Did you have a look at the existing lexibank entry point? It very much formalizes the policy of “this is a CLDF dataset already, just make it available”.

LinguList commented 3 years ago

Where do I find the lexibank entry point? All that I think would needed to be done is 1) include concepticon mappings for 2.5, 2) make mappings for those sounds which are not strict CLTS. So an additional orthography profile. Maybe it is then enough to just modify the entry point, but I did not find it when checking the code from the web interface of github.

Anaphory commented 3 years ago

It's actually the file I linked to above: https://github.com/lessersunda/lexirumah-data/blob/with_lexi_data/pylexirumah/lexibank.py

LinguList commented 3 years ago

I just had a look and it is not straightforward to use our data check routines on top of the entry point as the entry point assumes a different structure (file etc for languages and orthoprofiles, etc.), so it is not trivial to run the checks. All I could do is provide two scripts: one checks with the actual Concepticon for data mappings, and one which checks with CLTS, so the sound conversions in the orthography profile are in line with other datasets and can be directly compared with Phoible and the like.

Anaphory commented 3 years ago

I just had a look and it is not straightforward to use our data check routines on top of the entry point as the entry point assumes a different structure (file etc for languages and orthoprofiles, etc.), so it is not trivial to run the checks.

I'm sure we could add an etc file (or folder) to the lexirumah-data repository, and I still think that would be the more robust solution considering potential future extensions compared to keeping them in a completely separate repository.

But I have some difficulties understanding the context. Presumably, we have Concepticon mappings and segments that conform to an older version of CLTS already in concepts.csv respectively forms.csv and languages already in lects.csv. The point of CLDF is to have a standard structure for datasets, after all. I would consider any lack of conformity a bug of the LexiRumah dataset which should be fixed right here, is there a conceptual problem with that?

All I could do is provide two scripts: one checks with the actual Concepticon for data mappings, and one which checks with CLTS, so the sound conversions in the orthography profile are in line with other datasets and can be directly compared with Phoible and the like.

It makes sense to add the checks that forms and concepts really conform to CLTS and Concepticon (instead of just promising that they are, without checks) to this repository, but I don't understand why this is a problem. Whether a future pull request contains changes to one file or three doesn't make much of a difference.

Can you point me to another CLDF dataset example which might help me understand the problem and what needs to be done?

LinguList commented 3 years ago

The conceptlists are integrated via cldfbench, and specified in the metadata file of a wordlist repository (as given in the lexibank organization): https://github.com/lexibank/wanghmongmien/

Versions are then also written to the cldf/cldf-metadata.json, so we know which version of Concepticon was used.

With CLTS, this is similar (version specified), and we have ways to check for prosodic problems (segmentations that lead to empty morphemes, for example).

But if you have ways to check without using cldfbench, this whole issue just can be closed. Then I just informed you with this issue that an updated version is already in concepticon's master, and can be used in the future, if concepts don't change, and has the advantage of being curated by many people and also regularly updated.

LinguList commented 3 years ago

The identifier of the list is Klamer-2018-607. If you have any demands on modifying or suggestions, these will also be welcome. We will probably release the next version of Concepticon next week.

LinguList commented 3 years ago

I just checked the segments with CLTS, turns out there are only two cases:

------------------------------  ------------------------------------  --
lexirumah-urua1244-eleven-1     p u tç a _ r e s i n _ s a            tç
lexirumah-urua1244-twelve-1     p u tç a _ r e s i n _ n u a          tç
lexirumah-urua1244-eighteen-1   p u tç a _ r e s i n _ t e r i n u a  tç
lexirumah-urua1244-fifteen-1    p u tç a _ r e s i _ n i m a          tç
lexirumah-urua1244-fourteen-1   p u tç a _ r e s i n _ f a t          tç
lexirumah-urua1244-nineteen-1   p u tç a _ r e s i n _ s a p u t i    tç
lexirumah-urua1244-seventeen-1  p u tç a _ r e s i n _ t a r a n s a  tç
lexirumah-urua1244-sixteen-1    p u tç a _ r e s i _ n e m            tç
lexirumah-urua1244-thirteen-1   p u tç a _ r e s i n _ t e n i        tç
lexirumah-maib1239-salt-2       p o - k a s                           -
------------------------------  ------------------------------------  --

The tç should rather be simply c or tɕ.

LinguList commented 3 years ago

There are furthermore about 750 forms that have phonotactic properties which I consider problematic (e.g, for lingpy alignments), as they have repeated _ _ word breaks, or repeated + + morpheme breaks, or initial or final breaks. These are better indicated in an extra column, since this is typically used to indicate something is a clitic, but this is word-form external information better handled in an extra column.

segments.txt

LinguList commented 3 years ago

Last not least, here are the modified concept sets (comments welcome, as I figured I had some problems in my mappings, but went over them carefully again now).

English	Concepticon ID (OLD)	Concepticon ID (New)	Concepticon Gloss (New)
tokay gecko	2355
return	142	581	COME BACK
know	1410	3626	KNOW
cuscus	470
one hundred thousand	2078	3532	ONE HUNDRED THOUSAND
traditional house	1252
hide	602	2486	HIDE
louse in hair; head louse; mother louse	1392	310	HEAD LOUSE
day	1260	1225	DAY (NOT NIGHT)
smell	1586	2124	SMELL
2sg; 2pl	?	?	?
to chase away, to expel	30
think	1415	2271	THINK
penalty	1196
thin (non-human)	2307	2308	THIN
command; order	1128	1998	COMMAND
nephew; niece	173	3890	NEPHEW OR NIECE
below	2094	1485	BELOW OR UNDER
hit (drum)	11
search for; to hunt for	1468
skewer	398
father's sister	170	2691	PATERNAL AUNT (FATHER'S SISTER)
ridge; ridgepole; peak; tip	280	1748	RIDGE
rice grain head	2749
scared	1419	3033	SCARED
burn (clear land)	141	3539	BURN LAND
mother's brother	1984	2692	MATERNAL UNCLE (MOTHER'S BROTHER)
above	2379	1741	ABOVE
rule; govern	382	1846	RULE
fall over	1280	2894	TUMBLE (FALL DOWN)
blow	176	175	BLOW (OF WIND)
(coral) reef	660
dry (in sun)	2015	3364	DRY IN SUN
finished	1766
husked rice; uncooked rice	926	3289	UNCOOKED RICE
chase; run after	1085
betel vine	117	177	BETEL PEPPER VINE
sweat	125	2458	PERSPIRE OR SWEAT
leech	949	2273	LEECH
sleepy	1757	3620	SLEEPY
carry	413	700	CARRY
coconut shell	2649

LinguList commented 3 years ago

And a last remark: Since the word boundary vs. morpheme boundary is also a characteristics of the morpheme relations rather than the morpheme form itself, I'd also suggest to use only one character for separation. I was of a different opinion some time ago, but given the problems this created in lingpy's alignments at times, I am now a supporter of one separator only. Yet this is not in any form requested for by CLDF.

Anaphory commented 3 years ago

So I think

[x] Add the concept list https://github.com/concepticon/concepticon-data/blob/master/concepticondata/conceptlists/Klamer-2018-607.tsv to the wordlist metadata file
[x] Specify the concepticon version in the metadata file
[ ] Specify CLTS version in the metadata file
[x] Replace tç – for that I need @OwenAmarasi to tell me about Uruang Nirin's phonology
[x] In maib1239-salt-2, replace the - with a +
[x] In concepts.csv, replace old Concepticon_IDs with better ones. I think blow should be kept at 176, not 175; I believe it comes in the context of bodily actions.

are easy to do. They are, reasonably, bugfixes or added features, so they would go through with a minor version or a patch number.

For the comments about the morpheme boundaries (unify them, and denote in a separate column what their nature is – affix, clitic, separate word; circumfix or not) I'm hesitant to change that for now. It would be a major change of the data, warranting a new major version, and also such a column is not part of CLDF yet, so it would lose information that is currently accessible to CLDF-compatible software in principle.

I'm still not sure in which API format the validators for concepticon and CLTS would be needed. The wanghmongmien dataset has no such tests in its lexibank entry point, and the test.py only validates the dataset, independent of Concepticon and CLTS.

LinguList commented 3 years ago

I understand completely with the morpheme boundaries, and it is easy to handle this in a few lines of code, when doing alignments and the like, and it is -- as you point out -- not part of any CLDF requirement.

lessersunda / lexirumah-data

Could we add a cldfbench-cldf dataset for the data? #123