Closed LinguList closed 2 years ago
I think it would be marginally confusing to have https://github.com/lessersunda/lexirumah-data/blob/with_lexi_data/pylexirumah/lexibank.py and also a separate repository for lexibank. If you put a PR for this repository here through, I'll happily merge it. It also increases the chance that it gets updated if in the future editing and expanding LexiRumah is taken up again, there is a possibility for that.
Please have a check that the h
is the last letter whereever you refer to this dataset, LexiRumah, not LexiRuhma.
Alright! This would overwrite the code in cldf
. Do you have any specific requirements or are you fine with the most recent cldf release?
And one more question: the current dump of the data is in raw/forms.csv, right?
No, we have been working with CLDF throughout instead of weird other formats, so the current version of the data (9 months old) is in cldf/forms.csv
, with everything else CLDF being in there. I must admit I don't remember why there is still data in raw/
. The files seem to be largely identical, apart from two commits in 2019 that made them diverge.
Did you have a look at the existing lexibank entry point? It very much formalizes the policy of “this is a CLDF dataset already, just make it available”.
Where do I find the lexibank entry point? All that I think would needed to be done is 1) include concepticon mappings for 2.5, 2) make mappings for those sounds which are not strict CLTS. So an additional orthography profile. Maybe it is then enough to just modify the entry point, but I did not find it when checking the code from the web interface of github.
It's actually the file I linked to above: https://github.com/lessersunda/lexirumah-data/blob/with_lexi_data/pylexirumah/lexibank.py
I just had a look and it is not straightforward to use our data check
routines on top of the entry point as the entry point assumes a
different structure (file etc
for languages and orthoprofiles, etc.),
so it is not trivial to run the checks. All I could do is provide two
scripts: one checks with the actual Concepticon for data mappings, and
one which checks with CLTS, so the sound conversions in the orthography
profile are in line with other datasets and can be directly compared
with Phoible and the like.
I just had a look and it is not straightforward to use our data check routines on top of the entry point as the entry point assumes a different structure (file
etc
for languages and orthoprofiles, etc.), so it is not trivial to run the checks.
I'm sure we could add an etc
file (or folder) to the
lexirumah-data repository, and I still think that would be the
more robust solution considering potential future extensions
compared to keeping them in a completely separate repository.
But I have some difficulties understanding the context. Presumably, we have Concepticon mappings and segments that conform to an older version of CLTS already in concepts.csv respectively forms.csv and languages already in lects.csv. The point of CLDF is to have a standard structure for datasets, after all. I would consider any lack of conformity a bug of the LexiRumah dataset which should be fixed right here, is there a conceptual problem with that?
All I could do is provide two scripts: one checks with the actual Concepticon for data mappings, and one which checks with CLTS, so the sound conversions in the orthography profile are in line with other datasets and can be directly compared with Phoible and the like.
It makes sense to add the checks that forms and concepts really conform to CLTS and Concepticon (instead of just promising that they are, without checks) to this repository, but I don't understand why this is a problem. Whether a future pull request contains changes to one file or three doesn't make much of a difference.
Can you point me to another CLDF dataset example which might help me understand the problem and what needs to be done?
The conceptlists are integrated via cldfbench, and specified in the metadata file of a wordlist repository (as given in the lexibank organization): https://github.com/lexibank/wanghmongmien/
Versions are then also written to the cldf/cldf-metadata.json, so we know which version of Concepticon was used.
With CLTS, this is similar (version specified), and we have ways to check for prosodic problems (segmentations that lead to empty morphemes, for example).
But if you have ways to check without using cldfbench, this whole issue just can be closed. Then I just informed you with this issue that an updated version is already in concepticon's master, and can be used in the future, if concepts don't change, and has the advantage of being curated by many people and also regularly updated.
The identifier of the list is Klamer-2018-607. If you have any demands on modifying or suggestions, these will also be welcome. We will probably release the next version of Concepticon next week.
I just checked the segments with CLTS, turns out there are only two cases:
------------------------------ ------------------------------------ --
lexirumah-urua1244-eleven-1 p u tç a _ r e s i n _ s a tç
lexirumah-urua1244-twelve-1 p u tç a _ r e s i n _ n u a tç
lexirumah-urua1244-eighteen-1 p u tç a _ r e s i n _ t e r i n u a tç
lexirumah-urua1244-fifteen-1 p u tç a _ r e s i _ n i m a tç
lexirumah-urua1244-fourteen-1 p u tç a _ r e s i n _ f a t tç
lexirumah-urua1244-nineteen-1 p u tç a _ r e s i n _ s a p u t i tç
lexirumah-urua1244-seventeen-1 p u tç a _ r e s i n _ t a r a n s a tç
lexirumah-urua1244-sixteen-1 p u tç a _ r e s i _ n e m tç
lexirumah-urua1244-thirteen-1 p u tç a _ r e s i n _ t e n i tç
lexirumah-maib1239-salt-2 p o - k a s -
------------------------------ ------------------------------------ --
The tç should rather be simply c
or tɕ
.
There are furthermore about 750 forms that have phonotactic properties which I consider problematic (e.g, for lingpy alignments), as they have repeated _ _
word breaks, or repeated + +
morpheme breaks, or initial or final breaks. These are better indicated in an extra column, since this is typically used to indicate something is a clitic, but this is word-form external information better handled in an extra column.
Last not least, here are the modified concept sets (comments welcome, as I figured I had some problems in my mappings, but went over them carefully again now).
English | Concepticon ID (OLD) | Concepticon ID (New) | Concepticon Gloss (New) |
---|---|---|---|
tokay gecko | 2355 | ||
return | 142 | 581 | COME BACK |
know | 1410 | 3626 | KNOW |
cuscus | 470 | ||
one hundred thousand | 2078 | 3532 | ONE HUNDRED THOUSAND |
traditional house | 1252 | ||
hide | 602 | 2486 | HIDE |
louse in hair; head louse; mother louse | 1392 | 310 | HEAD LOUSE |
day | 1260 | 1225 | DAY (NOT NIGHT) |
smell | 1586 | 2124 | SMELL |
2sg; 2pl | ? | ? | ? |
to chase away, to expel | 30 | ||
think | 1415 | 2271 | THINK |
penalty | 1196 | ||
thin (non-human) | 2307 | 2308 | THIN |
command; order | 1128 | 1998 | COMMAND |
nephew; niece | 173 | 3890 | NEPHEW OR NIECE |
below | 2094 | 1485 | BELOW OR UNDER |
hit (drum) | 11 | ||
search for; to hunt for | 1468 | ||
skewer | 398 | ||
father's sister | 170 | 2691 | PATERNAL AUNT (FATHER'S SISTER) |
ridge; ridgepole; peak; tip | 280 | 1748 | RIDGE |
rice grain head | 2749 | ||
scared | 1419 | 3033 | SCARED |
burn (clear land) | 141 | 3539 | BURN LAND |
mother's brother | 1984 | 2692 | MATERNAL UNCLE (MOTHER'S BROTHER) |
above | 2379 | 1741 | ABOVE |
rule; govern | 382 | 1846 | RULE |
fall over | 1280 | 2894 | TUMBLE (FALL DOWN) |
blow | 176 | 175 | BLOW (OF WIND) |
(coral) reef | 660 | ||
dry (in sun) | 2015 | 3364 | DRY IN SUN |
finished | 1766 | ||
husked rice; uncooked rice | 926 | 3289 | UNCOOKED RICE |
chase; run after | 1085 | ||
betel vine | 117 | 177 | BETEL PEPPER VINE |
sweat | 125 | 2458 | PERSPIRE OR SWEAT |
leech | 949 | 2273 | LEECH |
sleepy | 1757 | 3620 | SLEEPY |
carry | 413 | 700 | CARRY |
coconut shell | 2649 |
And a last remark: Since the word boundary vs. morpheme boundary is also a characteristics of the morpheme relations rather than the morpheme form itself, I'd also suggest to use only one character for separation. I was of a different opinion some time ago, but given the problems this created in lingpy's alignments at times, I am now a supporter of one separator only. Yet this is not in any form requested for by CLDF.
So I think
maib1239-salt-2
, replace the -
with a +
concepts.csv
, replace old Concepticon_IDs with better ones. I think blow should be kept at 176, not 175; I believe it comes in the context of bodily actions.are easy to do. They are, reasonably, bugfixes or added features, so they would go through with a minor version or a patch number.
For the comments about the morpheme boundaries (unify them, and denote in a separate column what their nature is – affix, clitic, separate word; circumfix or not) I'm hesitant to change that for now. It would be a major change of the data, warranting a new major version, and also such a column is not part of CLDF yet, so it would lose information that is currently accessible to CLDF-compatible software in principle.
I'm still not sure in which API format the validators for concepticon and CLTS would be needed. The wanghmongmien dataset has no such tests in its lexibank entry point, and the test.py
only validates the dataset, independent of Concepticon and CLTS.
I understand completely with the morpheme boundaries, and it is easy to handle this in a few lines of code, when doing alignments and the like, and it is -- as you point out -- not part of any CLDF requirement.
We will add the concept list to concepticon 2.5 (the PR was just merged), and since this may have changes on the data (since our links to concepticon include more recently added concepts), it would be useful to do the CLDF conversion with cldfbench/pylexibank, to make sure that the data also runs through our quality checks.
If you allow, @Anaphory, I could either add an extra repository, or I could make a PR where the data from
raw
is processed with the help of cldfbench. An extra repository may be easier, as it preserves the unified access to cldf data, but it is in fact not really needed, one would just add thecldfbench_lexiruhma.py
script in the main directory.