lexibank / lexibank-analysed

Study on lexibank data (presenting the lexibank dataset).
Creative Commons Attribution 4.0 International
9 stars 3 forks source link

Duplicate rows #49

Open LEK85 opened 1 year ago

LEK85 commented 1 year ago

Hi, I found several duplicated rows in cldf/forms.csv. Ignoring ID and Local_ID, there are 20,076 rows in which the remaining 15 columns coincide exactly with at least one other row. I attach the ID's for these cases. duplicate.ids.csv

LinguList commented 1 year ago

Can you provide the datasets here? These are the sources of duplications, not lexibank. Yangyi, for example, has duplicates in the original, which we'd need to filter out in our lexibank code.

chrzyki commented 1 year ago

halenepal was a systematic issue with lexemes/forms being added twice (fixed here https://github.com/lexibank/halenepal/commit/8b9689232ece058be986a4c304babdaf0318cd51). Should I check the other datasets (except for Yangyi), @LinguList?

LEK85 commented 1 year ago

It was mostly halenepal (Hale1973), but there are other 38 problematic sources. Attaching the full dataset of duplications, including Source. Here are the frequencies of identical rows per source:

n Dataset Freq
1 halenepal 11338
2 transnewguineaorg 3227
3 polyglottaafricana 2409
4 yangyi 973
5 abvdoceanic 669
6 zgraggenmadang 296
7 marrisonnaga 280
8 bodtkhobwa 157
9 naganorgyalrongic 155
10 savelyevturkic 102
11 hubercolumbian 66
12 kleinewillinghoeferbikwinjen 66
13 bowernpny 48
14 blustaustronesian 44
15 kraftchadic 38
16 wold 38
17 northeuralex 30
18 peirosaustroasiatic 28
19 sagartst 14
20 wangbai 14
21 gerarditupi 12
22 utoaztecan 12
23 bantubvd 10
24 yuchinese 10
25 clarkkimmun 4
26 crossandean 4
27 sidwellbahnaric 4
28 starostinpie 4
29 walworthpolynesian 4
30 castrozhuang 2
31 chenhmongmien 2
32 dunnielex 2
33 gaotb 2
34 leejaponic 2
35 lindseyende 2
36 liusinitic 2
37 lundgrenomagoa 2
38 syrjaenenuralic 2
39 visserkalamang 2

duplicates.csv

chrzyki commented 6 months ago

Revisiting this issue. Spot-checking so far doesn't seem reveal any big systematic issues (as in halenepal), but I'll still have a closer look. Here are updated counts for the latest release (excluding halenepal):

Dataset Count
transnewguineaorg 3359
polyglottaafricana 2409
yanglalo 1228
abvdoceanic 1065
bodtkhobwa 732
yangyi 702
marrisonnaga 348
naganorgyalrongic 279
kleinewillinghoeferbikwinjen 102
savelyevturkic 88
zgraggenmadang 81
hubercolumbian 76
bowernpny 48
wold 44
blustaustronesian 44
northeuralex 42
kraftchadic 36
peirosaustroasiatic 26
utoaztecan 12
gerarditupi 12
sagartst 12
bantubvd 10
wangbai 8
sidwellbahnaric 4
castrozhuang 4
suntb 4
crossandean 4
dunnielex 2
visserkalamang 2
lindseyende 2
chenhmongmien 2
syrjaenenuralic 2
lundgrenomagoa 2
leejaponic 2
robinsonap 2
chindialectsurvey 2
gaotb 2

And here are all the cases (including an additional column Dataset):

multiple_rows.csv

LinguList commented 6 months ago

I suggest those cases with > 1000 in the data should be checked.

chrzyki commented 6 months ago

Yanglalo is essentially a lot of these (or very similar) cases (which sum up to the > 1200 total cases):

Forms that exist once (e.g. willow tree) in the source exist multiple times in the raw/ data:

yanglalo_src

Each entry has its own COGID and (with only the respective concept as its singular member). These three identical lines in the raw/ data become 7 languages × 3 repeated concepts = 21 total entries in this case:

yanglalo_cldf

I'm wondering:

LinguList commented 6 months ago

The data are corrupted. I am just checking against the PDF, where it is clear that this is not intended.

Then, there's the online supplement: https://opal.latrobe.edu.au/articles/thesis/Lalo_regional_varieties_phylogeny_dialectometry_and_sociolinguistics/21844209?file=38767287

There we find the solution: duplicate rows are due to an explicit representation of proto-lalo forms.

LinguList commented 6 months ago

There are as many duplicate rows as there are multiple morphemes in one proto-lalo form. So we can probably ignore the duplicates explicitly here by checking if the first, the Proto-Lalo form, was visited already.

LinguList commented 6 months ago

We should probably switch to the original data with the link shared above.

chrzyki commented 6 months ago

Thanks for confirming!

chrzyki commented 3 months ago

polyglottaafricana is good to go, see: https://github.com/lexibank/polyglottaafricana/pull/10 Will release in a bit.

chrzyki commented 2 months ago

With the update to yanglalo a couple of datasets with major 'repetition' should be fixed and I propose to double check this list again once we've got an RC for Lexibank 2.0.

FredericBlum commented 5 days ago

Here comes the updated list:

count dataset
1268 transnewguineaorg
800 idssegmented
776 tls
305 abvdoceanic
288 bodtkhobwa
288 huntergatherer
271 gravinachadic
161 marrisonnaga
150 heathdogon
135 naganorgyalrongic
105 yanglalo
80 halenepal
61 abvdphilippines
39 savelyevturkic
38 chenhmongmien
38 chindialectsurvey
34 lairgyalrong
34 zgraggenmadang
31 hubercolumbian
29 kochtukanoan
27 northeuralex
23 bowernpny
22 baf2
22 wold
21 keypano
18 kraftchadic
17 blustaustronesian
12 lundgrenomagoa
7 peirosaustroasiatic
6 polyglottaafricana
4 sagartst
4 tryonsolomon
4 yangyi
3 bantubvd
3 oskolskayatungusic
3 othanieljen
3 utoaztecan
3 wangbai
2 castrozhuang
2 gerarditupi
2 sidwellbahnaric
1 carvalhopurus
1 chacolanguages
1 dunnielex
1 leejaponic
1 lindseyende
1 mixtecansubgrouping
1 syrjaenenuralic
1 visserkalamang

We should perhaps still look at the datasets with more than 100 duplicate rows.