Duplicate rows - Githubissues

LEK85 commented 1 year ago

Hi, I found several duplicated rows in cldf/forms.csv. Ignoring ID and Local_ID, there are 20,076 rows in which the remaining 15 columns coincide exactly with at least one other row. I attach the ID's for these cases. duplicate.ids.csv

LinguList commented 1 year ago

Can you provide the datasets here? These are the sources of duplications, not lexibank. Yangyi, for example, has duplicates in the original, which we'd need to filter out in our lexibank code.

chrzyki commented 1 year ago

halenepal was a systematic issue with lexemes/forms being added twice (fixed here https://github.com/lexibank/halenepal/commit/8b9689232ece058be986a4c304babdaf0318cd51). Should I check the other datasets (except for Yangyi), @LinguList?

LEK85 commented 1 year ago

It was mostly halenepal (Hale1973), but there are other 38 problematic sources. Attaching the full dataset of duplications, including Source. Here are the frequencies of identical rows per source:

n	Dataset	Freq
1	halenepal	11338
2	transnewguineaorg	3227
3	polyglottaafricana	2409
4	yangyi	973
5	abvdoceanic	669
6	zgraggenmadang	296
7	marrisonnaga	280
8	bodtkhobwa	157
9	naganorgyalrongic	155
10	savelyevturkic	102
11	hubercolumbian	66
12	kleinewillinghoeferbikwinjen	66
13	bowernpny	48
14	blustaustronesian	44
15	kraftchadic	38
16	wold	38
17	northeuralex	30
18	peirosaustroasiatic	28
19	sagartst	14
20	wangbai	14
21	gerarditupi	12
22	utoaztecan	12
23	bantubvd	10
24	yuchinese	10
25	clarkkimmun	4
26	crossandean	4
27	sidwellbahnaric	4
28	starostinpie	4
29	walworthpolynesian	4
30	castrozhuang	2
31	chenhmongmien	2
32	dunnielex	2
33	gaotb	2
34	leejaponic	2
35	lindseyende	2
36	liusinitic	2
37	lundgrenomagoa	2
38	syrjaenenuralic	2
39	visserkalamang	2

duplicates.csv

chrzyki commented 6 months ago

Revisiting this issue. Spot-checking so far doesn't seem reveal any big systematic issues (as in halenepal), but I'll still have a closer look. Here are updated counts for the latest release (excluding halenepal):

Dataset	Count
transnewguineaorg	3359
polyglottaafricana	2409
yanglalo	1228
abvdoceanic	1065
bodtkhobwa	732
yangyi	702
marrisonnaga	348
naganorgyalrongic	279
kleinewillinghoeferbikwinjen	102
savelyevturkic	88
zgraggenmadang	81
hubercolumbian	76
bowernpny	48
wold	44
blustaustronesian	44
northeuralex	42
kraftchadic	36
peirosaustroasiatic	26
utoaztecan	12
gerarditupi	12
sagartst	12
bantubvd	10
wangbai	8
sidwellbahnaric	4
castrozhuang	4
suntb	4
crossandean	4
dunnielex	2
visserkalamang	2
lindseyende	2
chenhmongmien	2
syrjaenenuralic	2
lundgrenomagoa	2
leejaponic	2
robinsonap	2
chindialectsurvey	2
gaotb	2

And here are all the cases (including an additional column Dataset):

multiple_rows.csv

LinguList commented 6 months ago

I suggest those cases with > 1000 in the data should be checked.

chrzyki commented 6 months ago

Yanglalo is essentially a lot of these (or very similar) cases (which sum up to the > 1200 total cases):

Forms that exist once (e.g. willow tree) in the source exist multiple times in the raw/ data:

yanglalo_src

Each entry has its own COGID and (with only the respective concept as its singular member). These three identical lines in the raw/ data become 7 languages × 3 repeated concepts = 21 total entries in this case:

yanglalo_cldf

I'm wondering:

Are the raw/ data correct (in terms of number of (repeated) entries)?
Are the cognate IDs meaningful/correct? It seems as if they're just a running number/ID.

LinguList commented 6 months ago

The data are corrupted. I am just checking against the PDF, where it is clear that this is not intended.

Then, there's the online supplement: https://opal.latrobe.edu.au/articles/thesis/Lalo_regional_varieties_phylogeny_dialectometry_and_sociolinguistics/21844209?file=38767287

There we find the solution: duplicate rows are due to an explicit representation of proto-lalo forms.

LinguList commented 6 months ago

There are as many duplicate rows as there are multiple morphemes in one proto-lalo form. So we can probably ignore the duplicates explicitly here by checking if the first, the Proto-Lalo form, was visited already.

LinguList commented 6 months ago

We should probably switch to the original data with the link shared above.

chrzyki commented 6 months ago

Thanks for confirming!

chrzyki commented 3 months ago

polyglottaafricana is good to go, see: https://github.com/lexibank/polyglottaafricana/pull/10 Will release in a bit.

chrzyki commented 2 months ago

With the update to yanglalo a couple of datasets with major 'repetition' should be fixed and I propose to double check this list again once we've got an RC for Lexibank 2.0.

FredericBlum commented 5 days ago

Here comes the updated list:

count	dataset
1268	transnewguineaorg
800	idssegmented
776	tls
305	abvdoceanic
288	bodtkhobwa
288	huntergatherer
271	gravinachadic
161	marrisonnaga
150	heathdogon
135	naganorgyalrongic
105	yanglalo
80	halenepal
61	abvdphilippines
39	savelyevturkic
38	chenhmongmien
38	chindialectsurvey
34	lairgyalrong
34	zgraggenmadang
31	hubercolumbian
29	kochtukanoan
27	northeuralex
23	bowernpny
22	baf2
22	wold
21	keypano
18	kraftchadic
17	blustaustronesian
12	lundgrenomagoa
7	peirosaustroasiatic
6	polyglottaafricana
4	sagartst
4	tryonsolomon
4	yangyi
3	bantubvd
3	oskolskayatungusic
3	othanieljen
3	utoaztecan
3	wangbai
2	castrozhuang
2	gerarditupi
2	sidwellbahnaric
1	carvalhopurus
1	chacolanguages
1	dunnielex
1	leejaponic
1	lindseyende
1	mixtecansubgrouping
1	syrjaenenuralic
1	visserkalamang

We should perhaps still look at the datasets with more than 100 duplicate rows.

lexibank / lexibank-analysed

Duplicate rows #49