Open LEK85 opened 1 year ago
Can you provide the datasets here? These are the sources of duplications, not lexibank. Yangyi, for example, has duplicates in the original, which we'd need to filter out in our lexibank code.
halenepal
was a systematic issue with lexemes/forms being added twice (fixed here https://github.com/lexibank/halenepal/commit/8b9689232ece058be986a4c304babdaf0318cd51). Should I check the other datasets (except for Yangyi), @LinguList?
It was mostly halenepal
(Hale1973), but there are other 38 problematic sources. Attaching the full dataset of duplications, including Source
. Here are the frequencies of identical rows per source:
n | Dataset | Freq |
---|---|---|
1 | halenepal | 11338 |
2 | transnewguineaorg | 3227 |
3 | polyglottaafricana | 2409 |
4 | yangyi | 973 |
5 | abvdoceanic | 669 |
6 | zgraggenmadang | 296 |
7 | marrisonnaga | 280 |
8 | bodtkhobwa | 157 |
9 | naganorgyalrongic | 155 |
10 | savelyevturkic | 102 |
11 | hubercolumbian | 66 |
12 | kleinewillinghoeferbikwinjen | 66 |
13 | bowernpny | 48 |
14 | blustaustronesian | 44 |
15 | kraftchadic | 38 |
16 | wold | 38 |
17 | northeuralex | 30 |
18 | peirosaustroasiatic | 28 |
19 | sagartst | 14 |
20 | wangbai | 14 |
21 | gerarditupi | 12 |
22 | utoaztecan | 12 |
23 | bantubvd | 10 |
24 | yuchinese | 10 |
25 | clarkkimmun | 4 |
26 | crossandean | 4 |
27 | sidwellbahnaric | 4 |
28 | starostinpie | 4 |
29 | walworthpolynesian | 4 |
30 | castrozhuang | 2 |
31 | chenhmongmien | 2 |
32 | dunnielex | 2 |
33 | gaotb | 2 |
34 | leejaponic | 2 |
35 | lindseyende | 2 |
36 | liusinitic | 2 |
37 | lundgrenomagoa | 2 |
38 | syrjaenenuralic | 2 |
39 | visserkalamang | 2 |
Revisiting this issue. Spot-checking so far doesn't seem reveal any big systematic issues (as in halenepal), but I'll still have a closer look. Here are updated counts for the latest release (excluding halenepal):
Dataset | Count |
---|---|
transnewguineaorg | 3359 |
polyglottaafricana | 2409 |
yanglalo | 1228 |
abvdoceanic | 1065 |
bodtkhobwa | 732 |
yangyi | 702 |
marrisonnaga | 348 |
naganorgyalrongic | 279 |
kleinewillinghoeferbikwinjen | 102 |
savelyevturkic | 88 |
zgraggenmadang | 81 |
hubercolumbian | 76 |
bowernpny | 48 |
wold | 44 |
blustaustronesian | 44 |
northeuralex | 42 |
kraftchadic | 36 |
peirosaustroasiatic | 26 |
utoaztecan | 12 |
gerarditupi | 12 |
sagartst | 12 |
bantubvd | 10 |
wangbai | 8 |
sidwellbahnaric | 4 |
castrozhuang | 4 |
suntb | 4 |
crossandean | 4 |
dunnielex | 2 |
visserkalamang | 2 |
lindseyende | 2 |
chenhmongmien | 2 |
syrjaenenuralic | 2 |
lundgrenomagoa | 2 |
leejaponic | 2 |
robinsonap | 2 |
chindialectsurvey | 2 |
gaotb | 2 |
And here are all the cases (including an additional column Dataset
):
I suggest those cases with > 1000 in the data should be checked.
Yanglalo is essentially a lot of these (or very similar) cases (which sum up to the > 1200 total cases):
Forms that exist once (e.g. willow tree) in the source exist multiple times in the raw/
data:
Each entry has its own COGID and (with only the respective concept as its singular member). These three identical lines in the raw/
data become 7 languages × 3 repeated concepts = 21 total entries in this case:
I'm wondering:
raw/
data correct (in terms of number of (repeated) entries)?The data are corrupted. I am just checking against the PDF, where it is clear that this is not intended.
Then, there's the online supplement: https://opal.latrobe.edu.au/articles/thesis/Lalo_regional_varieties_phylogeny_dialectometry_and_sociolinguistics/21844209?file=38767287
There we find the solution: duplicate rows are due to an explicit representation of proto-lalo forms.
There are as many duplicate rows as there are multiple morphemes in one proto-lalo form. So we can probably ignore the duplicates explicitly here by checking if the first, the Proto-Lalo form, was visited already.
We should probably switch to the original data with the link shared above.
Thanks for confirming!
polyglottaafricana is good to go, see: https://github.com/lexibank/polyglottaafricana/pull/10 Will release in a bit.
With the update to yanglalo a couple of datasets with major 'repetition' should be fixed and I propose to double check this list again once we've got an RC for Lexibank 2.0.
Here comes the updated list:
count | dataset |
---|---|
1268 | transnewguineaorg |
800 | idssegmented |
776 | tls |
305 | abvdoceanic |
288 | bodtkhobwa |
288 | huntergatherer |
271 | gravinachadic |
161 | marrisonnaga |
150 | heathdogon |
135 | naganorgyalrongic |
105 | yanglalo |
80 | halenepal |
61 | abvdphilippines |
39 | savelyevturkic |
38 | chenhmongmien |
38 | chindialectsurvey |
34 | lairgyalrong |
34 | zgraggenmadang |
31 | hubercolumbian |
29 | kochtukanoan |
27 | northeuralex |
23 | bowernpny |
22 | baf2 |
22 | wold |
21 | keypano |
18 | kraftchadic |
17 | blustaustronesian |
12 | lundgrenomagoa |
7 | peirosaustroasiatic |
6 | polyglottaafricana |
4 | sagartst |
4 | tryonsolomon |
4 | yangyi |
3 | bantubvd |
3 | oskolskayatungusic |
3 | othanieljen |
3 | utoaztecan |
3 | wangbai |
2 | castrozhuang |
2 | gerarditupi |
2 | sidwellbahnaric |
1 | carvalhopurus |
1 | chacolanguages |
1 | dunnielex |
1 | leejaponic |
1 | lindseyende |
1 | mixtecansubgrouping |
1 | syrjaenenuralic |
1 | visserkalamang |
We should perhaps still look at the datasets with more than 100 duplicate rows.
Hi, I found several duplicated rows in
cldf/forms.csv
. IgnoringID
andLocal_ID
, there are 20,076 rows in which the remaining 15 columns coincide exactly with at least one other row. I attach the ID's for these cases. duplicate.ids.csv