reality check cognacy - possible extra cognates filled in relatively easily!

HedvigS commented 2 months ago

I checked the FormTable for two kinds of issues that could be found easily and presented to a human reviewer for improvement of ABVD cognacy. Some forms may be identical but shouldn't belong to the same cognacy class, and vice versa, so human reviewing is necessary. I'm presenting these instances for the ABVD-team to consider.

forms without cognacy that are identical to other forms for the same concept which have assigned cognacy

example: Concept	Form_1	Cognacy_1	Form_2
hand	tangan	18	tangan
left	karuk	83	karuk
legfoot	au	80	au
legfoot	kuku	46	kuku

The amount that can be filled in like this are 8% of the entire dataset, 24903 forms .

EDIT: I didn't cut down duplicate matches appropriately, the number is smaller (2-9,000) and @xrotwang and I get different numbers. If the ABVD-team wants to investigate this, I can spend more time fine-tuning the pattern-finding.

abvd_possible_matches.csv

same form, same concept but different cognacy classes.

example: Language_ID	Form	Cognacy
134	sunu	1?
163	sunu	106
1353	sunu	4
1368	sunu	1

There are 661 concept-form matchings ('144_toburn' - 'sunu') where there are identical forms assigned to different cognacy classes. If you also include cases of multiple cognacy (e.g. '1, 50') there are 4910 of this kind.

abvd_different_cognate_same_form_excl_multiple.csv

Rscript for finding these instances:

library(tidyverse)
library(cluster)
library(reshape2)
library(stringdist)

forms <- read_csv("https://github.com/lexibank/abvd/raw/ccff2bc86c30b102cd5b95174fafb378ddc0d3eb/cldf/forms.csv", show_col_types = F)

unknown_cognacy <-  forms %>% 
  filter(is.na(Cognacy)) %>% 
  dplyr::select(ID_Var1= ID, Var1 = Form)  

known_cognacy <- forms %>% 
  filter(!is.na(Cognacy)) %>% 
  dplyr::select(Var2 = Form, Var2_Cognacy = Cognacy)  

percentage_unknown <- round(nrow(unknown_cognacy)/ (nrow(known_cognacy) + nrow(unknown_cognacy) ), digits = 2)

known_cognacy <- known_cognacy %>% distinct()
unknown_cognacy <- unknown_cognacy %>% distinct()

#make df to join to in loop
dist_full <- matrix(nrow = 0, ncol = 7) %>% 
  as.data.frame() %>% 
  rename("Var1" = V1, "Var2" = V2, "lv_dist"= V3, "Form_1" = V4, "Cognacy_1" = V5, "Form_2" = V6, "Cognacy_2" = V7) %>% 
  mutate_if(.predicate = is.logical, as.character) %>% 
  mutate(lv_dist = as.numeric(lv_dist))

##
#df to join info on the side of dists df
left <- forms %>%
dplyr::select(Var1 = ID, Form_1 = Form, Cognacy_1 = Cognacy)
right <- forms %>%
dplyr::select(Var2 = ID,Form_2 = Form, Cognacy_2 = Cognacy)

#vector of unique concepts to loop over
Parameters_ID_unique_vector <- forms$Parameter_ID %>% unique()

#index to start loop at
index <- 0

#for loop, calcuating the lv dist each time for all words within each concept
for(Parameter in Parameters_ID_unique_vector){

index <- index + 1

cat(paste0("I'm on ", Parameters_ID_unique_vector[index], ". Which is index ", index, " out of ", length(Parameters_ID_unique_vector), ".\n"))

forms_spec <- forms %>%
filter(Parameter_ID == Parameters_ID_unique_vector[index])
#filter(Parameter_ID == "122_water")

form_vec <- as.vector(forms_spec$Form)

names(form_vec) <- forms_spec$ID

dists <- stringdistmatrix(a = form_vec, b = form_vec, method = "lv",  useNames = "names")

dists[upper.tri(dists, diag = T)] <- NA

dists_long <- dists %>%
reshape2::melt() %>%
filter(!is.na(value)) %>%
filter(value <= 2)  %>%
distinct() %>%
mutate(Var1 = as.character(Var1)) %>%
mutate(Var2 = as.character(Var2)) %>%
rename(lv_dist = value) %>%
left_join(left, by = "Var1") %>%
left_join(right, by = "Var2") %>%
distinct()

dist_full <- full_join(dist_full, dists_long, by = c("Var1", "Var2", "lv_dist", "Form_1", "Cognacy_1", "Form_2", "Cognacy_2"))  %>%
  distinct()

}

#different cognate same form
different_cognate_same_form_incl_multiple <- forms %>% 
  filter(!is.na(Cognacy)) %>% 
  distinct(Form, Parameter_ID, Cognacy) %>% 
  group_by(Form, Parameter_ID) %>%
  summarise(n = n(), .groups = "drop") %>% 
  filter(n > 1) 

different_cognate_same_form_excl_multiple <- forms %>% 
  filter(!is.na(Cognacy)) %>% 
  filter(!str_detect(Cognacy, ",")) %>% 
  distinct(Form, Parameter_ID, Cognacy) %>% 
  group_by(Form, Parameter_ID) %>%
  summarise(n = n(), .groups = "drop") %>% 
  filter(n > 1) %>% 
  arrange(desc(n))

different_cognate_same_form_excl_multiple %>% 
  write_csv("output/abvd_different_cognate_same_form_excl_multiple.csv", na = "")

cat("There are ", nrow(different_cognate_same_form_excl_multiple), " concept-form matchings ('144_toburn' - 'sunu') where there are identical forms assigned to different cognacy classes. If you also include cases of multiple cognacy (e.g. '1, 50', there are ", nrow(different_cognate_same_form_incl_multiple), " of this kind.\n", sep = "")

#forms with missing cognacy that could probably be filled in easily
possible_matches <- dist_full %>% 
  filter(is.na(Cognacy_2)) %>% 
  filter(!is.na(Cognacy_1)) %>% 
  filter(lv_dist <= 0)

possible_matches %>% 
write_csv("output/abvd_possible_matches.csv", na = "")

cat("There are ", nrow(possible_matches 
), " words where you could easily fill in the cognacy because they are identical to other words which are already filled in for cognacy. For example, 'tangan' for the concept hand is assigned cognacy class 18 in some languages but no cognacy in others. The amount that can be filled in like this are ",round(nrow(possible_matches 
) / nrow(forms), 2) *100, "% of the entire dataset.\n", sep = "")

xrotwang commented 2 months ago

This could be a good approach to fill in gaps in ABVD. I couldn't get the same counts, though, when checking. I only count 8,335 cases of non-cognate-coded forms which are identical to cognate-coded forms in the same concept slot (vs your 24903 from case 1.)

Here's my code (it does not use the Cognacy column, but the items from CognateTable):

import collections

from csvw.dsv import UnicodeWriter
from pycldf import Dataset

def check(ds):
    cogs = collections.defaultdict(set)
    for c in ds.objects('CognateTable'):
        cogs[c.cldf.formReference].add(c.cldf.cognatesetReference)

    # Map form, concept pairs to sets of cognateset IDs
    forms = collections.defaultdict(set)
    for form in ds.objects('FormTable'):
        if form.id in cogs:  # A cognate-coded form
            forms[(form.cldf.form, form.cldf.parameterReference)] |= cogs[form.id]

    with UnicodeWriter('res.csv') as w:
        w.writerow(['ID', 'Form', 'Language', 'Concept', 'Ncogs'])
        for form in ds.objects('FormTable'):
            if form.id not in cogs:  # A non-cognate-coded form ...
                if (form.cldf.form, form.cldf.parameterReference) in forms:
                    # ... but there are identical, cognate-coded forms.
                    csids = forms[(form.cldf.form, form.cldf.parameterReference)]
                    w.writerow([form.id, form.cldf.form, form.cldf.languageReference, form.cldf.parameterReference, len(csids)])

if __name__ == '__main__':
    check(Dataset.from_metadata('cldf/cldf-metadata.json'))

The result is a table listing all 9716 forms that are not cognate-coded, but have identical forms for other languages which are, and the number of cognatesets these identical forms are assigned to, and I get:

  5. "Ncogs"

    Type of data:          Number
    Contains null values:  False
    Unique values:         5
    Smallest value:        1,
    Largest value:         5,
    Sum:                   11.409,
    Mean:                  1,174
    Median:                1,
    StDev:                 0,464
    Most common values:    1, (8335x)
                           2, (1103x)
                           3, (245x)
                           4, (32x)
                           5, (1x)

Row count: 9716

xrotwang commented 2 months ago

Just confirmed my numbers via SQL:

select
  count(distinct f1.cldf_id)
from
  formtable as f1
join
  formtable as f2
on
  f1.cldf_form = f2.cldf_form and
  f1.cldf_parameterReference = f2.cldf_parameterReference
where
  f1.cldf_id not in (select cldf_formReference from cognatetable) and
  f2.cldf_id in (select cldf_formReference from cognatetable)

gives:

$ sqlite3 abvd.sqlite < q.sql 
9716

xrotwang commented 2 months ago

Or more performant and transparent:

select
  ncogs, count(cldf_id)
from (
  select
    f1.cldf_id, count(distinct c.cldf_cognatesetReference) as ncogs
  from
    formtable as f1
  join
    formtable as f2
  on
    f1.cldf_form = f2.cldf_form and
    f1.cldf_parameterReference = f2.cldf_parameterReference
  join
    cognatetable as c
  on
    f2.cldf_id = c.cldf_formReference
  where
    f1.cldf_id not in (select cldf_formReference from cognatetable)
  group by
    f1.cldf_id
)
group by ncogs

yielding

ncogs	cldf_id
1	8335
2	1103
3	245
4	32
5	1

HedvigS commented 2 months ago

Thanks @xrotwang !

I don't fully follow those scripts, but that's okay. I think you're def right that my original approach over-estimates category 1 numbers.

I did figure out one improvement as I was reading through, the original script does spit back too many rows for category 1 (possible matches) because it doesn't cut down duplicates correctly. I've made a small adjustment so that it spits out a summary per unknown form instead:

#forms with missing cognacy that could probably be filled in easily
possible_matches <- dist_full %>% 
[abvd_possible_matches.csv](https://github.com/user-attachments/files/16026389/abvd_possible_matches.csv)

  filter(is.na(Cognacy_2)) %>% 
  filter(!is.na(Cognacy_1)) %>% 
  filter(lv_dist <= 0) %>% 
  group_by(Var2, Form_2) %>% 
  summarise(Cogancy_suggestions = paste0(unique(Cognacy_1), collapse = "; "))

There are much fewer rows that way, 2756 and it looks like this:

Var2	Form_2	Cogancy_suggestions
1-13_back-1	tundun	55
1-81_sharp-1	mangan	64
10-189_who-1	mei	65
100-177_this-1	di	13
100-88_tosqueeze-1	kuku	14
1000-185_we-1	kinta	I
101-140_dry-1	kor	29
1013-185_we-3	kəlau	I
1014-185_we-4	kam	2,65
1018-185_we-3	kir	I
1018-185_we-4	kim	E
102-127_woodsforest-1	ao	15,14
1029-185_we-4	kemem	E42
119-182_i-1	ja	1; 1,21

(I used semi-colon to separate multiple suggestions, commas denote compound or sub-cognacy class.)

abvd_possible_matches.csv

If the ABVD-team wants to go ahead with either occurrences of type (1) and/or (2), then I'd be happy to dig into it more the differences between mine and Robert's approaches, yet other different ways of finding these etc. Right now, I just wanted to present the basic idea here and hear if anyone wants to proceed with it at all in terms of expert reviewing and implementation.

Category 1 occurrences (probably easy to fill in) seem most important to me, since software like BEAST2 will treat all missing cognacy classes as lots of unique cognacy classes, unless I'm mistaken. Category 2 is also important, but perhaps less so (also fewer occurrences).

xrotwang commented 2 months ago

FYI: Judging from the output, your script now seems to miss some possible matches, e.g. 999-206_ten-1

$ csvgrep -c Form -m"hampuluʔ" abvd-cldf/cldf/forms.csv | csvgrep -c Parameter_ID -m"206" | csvcut -c ID,Form,Cognacy
ID,Form,Cognacy
431-206_ten-1,hampuluʔ,5
999-206_ten-1,hampuluʔ,

SimonGreenhill commented 2 months ago

We do not want to do this. Yes, I realise that many cognates can be easily filled in, and are probably cognate. However,
we decided a very long time ago that each cognate judgement would be checked and no automated guessing would be done.

since software like BEAST2 will treat all missing cognacy classes as lots of unique cognacy classes.

No, it depends on how the nexus file is made.

HedvigS commented 2 months ago

We do not want to do this. Yes, I realise that many cognates can be easily filled in, and are probably cognate. However, we decided a very long time ago that each cognate judgement would be checked and no automated guessing would be done.

Okay, not doing. I was just following up on discussions with Russell and Mary and just wanted to get it wrapped up. One way of wrapping up is to not do it. Understood!

since software like BEAST2 will treat all missing cognacy classes as lots of unique cognacy classes.

No, it depends on how the nexus file is made.

Genau, I just thought that was how it was done for the analysis of ABVD.

lexibank / abvd

reality check cognacy - possible extra cognates filled in relatively easily! #22