lexibank / abvd

CLDF dataset derived from Greenhill et al.'s "Austronesian Basic Vocabulary Database" from 2020.
https://abvd.eva.mpg.de
Creative Commons Attribution 4.0 International
2 stars 2 forks source link

reality check cognacy - possible extra cognates filled in relatively easily! #22

Closed HedvigS closed 2 months ago

HedvigS commented 2 months ago

I checked the FormTable for two kinds of issues that could be found easily and presented to a human reviewer for improvement of ABVD cognacy. Some forms may be identical but shouldn't belong to the same cognacy class, and vice versa, so human reviewing is necessary. I'm presenting these instances for the ABVD-team to consider.

  1. forms without cognacy that are identical to other forms for the same concept which have assigned cognacy
example: Concept Form_1 Cognacy_1 Form_2 Cognacy_2
hand tangan 18 tangan
left karuk 83 karuk
legfoot au 80 au
legfoot kuku 46 kuku

The amount that can be filled in like this are 8% of the entire dataset, 24903 forms .

EDIT: I didn't cut down duplicate matches appropriately, the number is smaller (2-9,000) and @xrotwang and I get different numbers. If the ABVD-team wants to investigate this, I can spend more time fine-tuning the pattern-finding.

abvd_possible_matches.csv

  1. same form, same concept but different cognacy classes.
example: Language_ID Form Cognacy
134 sunu 1?
163 sunu 106
1353 sunu 4
1368 sunu 1

There are 661 concept-form matchings ('144_toburn' - 'sunu') where there are identical forms assigned to different cognacy classes. If you also include cases of multiple cognacy (e.g. '1, 50') there are 4910 of this kind.

abvd_different_cognate_same_form_excl_multiple.csv

Rscript for finding these instances:

library(tidyverse)
library(cluster)
library(reshape2)
library(stringdist)

forms <- read_csv("https://github.com/lexibank/abvd/raw/ccff2bc86c30b102cd5b95174fafb378ddc0d3eb/cldf/forms.csv", show_col_types = F)

unknown_cognacy <-  forms %>% 
  filter(is.na(Cognacy)) %>% 
  dplyr::select(ID_Var1= ID, Var1 = Form)  

known_cognacy <- forms %>% 
  filter(!is.na(Cognacy)) %>% 
  dplyr::select(Var2 = Form, Var2_Cognacy = Cognacy)  

percentage_unknown <- round(nrow(unknown_cognacy)/ (nrow(known_cognacy) + nrow(unknown_cognacy) ), digits = 2)

known_cognacy <- known_cognacy %>% distinct()
unknown_cognacy <- unknown_cognacy %>% distinct()

#make df to join to in loop
dist_full <- matrix(nrow = 0, ncol = 7) %>% 
  as.data.frame() %>% 
  rename("Var1" = V1, "Var2" = V2, "lv_dist"= V3, "Form_1" = V4, "Cognacy_1" = V5, "Form_2" = V6, "Cognacy_2" = V7) %>% 
  mutate_if(.predicate = is.logical, as.character) %>% 
  mutate(lv_dist = as.numeric(lv_dist))

##
#df to join info on the side of dists df
left <- forms %>%
dplyr::select(Var1 = ID, Form_1 = Form, Cognacy_1 = Cognacy)
right <- forms %>%
dplyr::select(Var2 = ID,Form_2 = Form, Cognacy_2 = Cognacy)

#vector of unique concepts to loop over
Parameters_ID_unique_vector <- forms$Parameter_ID %>% unique()

#index to start loop at
index <- 0

#for loop, calcuating the lv dist each time for all words within each concept
for(Parameter in Parameters_ID_unique_vector){

index <- index + 1

cat(paste0("I'm on ", Parameters_ID_unique_vector[index], ". Which is index ", index, " out of ", length(Parameters_ID_unique_vector), ".\n"))

forms_spec <- forms %>%
filter(Parameter_ID == Parameters_ID_unique_vector[index])
#filter(Parameter_ID == "122_water")

form_vec <- as.vector(forms_spec$Form)

names(form_vec) <- forms_spec$ID

dists <- stringdistmatrix(a = form_vec, b = form_vec, method = "lv",  useNames = "names")

dists[upper.tri(dists, diag = T)] <- NA

dists_long <- dists %>%
reshape2::melt() %>%
filter(!is.na(value)) %>%
filter(value <= 2)  %>%
distinct() %>%
mutate(Var1 = as.character(Var1)) %>%
mutate(Var2 = as.character(Var2)) %>%
rename(lv_dist = value) %>%
left_join(left, by = "Var1") %>%
left_join(right, by = "Var2") %>%
distinct()

dist_full <- full_join(dist_full, dists_long, by = c("Var1", "Var2", "lv_dist", "Form_1", "Cognacy_1", "Form_2", "Cognacy_2"))  %>%
  distinct()

}

#different cognate same form
different_cognate_same_form_incl_multiple <- forms %>% 
  filter(!is.na(Cognacy)) %>% 
  distinct(Form, Parameter_ID, Cognacy) %>% 
  group_by(Form, Parameter_ID) %>%
  summarise(n = n(), .groups = "drop") %>% 
  filter(n > 1) 

different_cognate_same_form_excl_multiple <- forms %>% 
  filter(!is.na(Cognacy)) %>% 
  filter(!str_detect(Cognacy, ",")) %>% 
  distinct(Form, Parameter_ID, Cognacy) %>% 
  group_by(Form, Parameter_ID) %>%
  summarise(n = n(), .groups = "drop") %>% 
  filter(n > 1) %>% 
  arrange(desc(n))

different_cognate_same_form_excl_multiple %>% 
  write_csv("output/abvd_different_cognate_same_form_excl_multiple.csv", na = "")

cat("There are ", nrow(different_cognate_same_form_excl_multiple), " concept-form matchings ('144_toburn' - 'sunu') where there are identical forms assigned to different cognacy classes. If you also include cases of multiple cognacy (e.g. '1, 50', there are ", nrow(different_cognate_same_form_incl_multiple), " of this kind.\n", sep = "")

#forms with missing cognacy that could probably be filled in easily
possible_matches <- dist_full %>% 
  filter(is.na(Cognacy_2)) %>% 
  filter(!is.na(Cognacy_1)) %>% 
  filter(lv_dist <= 0)

possible_matches %>% 
write_csv("output/abvd_possible_matches.csv", na = "")

cat("There are ", nrow(possible_matches 
), " words where you could easily fill in the cognacy because they are identical to other words which are already filled in for cognacy. For example, 'tangan' for the concept hand is assigned cognacy class 18 in some languages but no cognacy in others. The amount that can be filled in like this are ",round(nrow(possible_matches 
) / nrow(forms), 2) *100, "% of the entire dataset.\n", sep = "")
xrotwang commented 2 months ago

This could be a good approach to fill in gaps in ABVD. I couldn't get the same counts, though, when checking. I only count 8,335 cases of non-cognate-coded forms which are identical to cognate-coded forms in the same concept slot (vs your 24903 from case 1.)

Here's my code (it does not use the Cognacy column, but the items from CognateTable):

import collections

from csvw.dsv import UnicodeWriter
from pycldf import Dataset

def check(ds):
    cogs = collections.defaultdict(set)
    for c in ds.objects('CognateTable'):
        cogs[c.cldf.formReference].add(c.cldf.cognatesetReference)

    # Map form, concept pairs to sets of cognateset IDs
    forms = collections.defaultdict(set)
    for form in ds.objects('FormTable'):
        if form.id in cogs:  # A cognate-coded form
            forms[(form.cldf.form, form.cldf.parameterReference)] |= cogs[form.id]

    with UnicodeWriter('res.csv') as w:
        w.writerow(['ID', 'Form', 'Language', 'Concept', 'Ncogs'])
        for form in ds.objects('FormTable'):
            if form.id not in cogs:  # A non-cognate-coded form ...
                if (form.cldf.form, form.cldf.parameterReference) in forms:
                    # ... but there are identical, cognate-coded forms.
                    csids = forms[(form.cldf.form, form.cldf.parameterReference)]
                    w.writerow([form.id, form.cldf.form, form.cldf.languageReference, form.cldf.parameterReference, len(csids)])

if __name__ == '__main__':
    check(Dataset.from_metadata('cldf/cldf-metadata.json'))

The result is a table listing all 9716 forms that are not cognate-coded, but have identical forms for other languages which are, and the number of cognatesets these identical forms are assigned to, and I get:

  5. "Ncogs"

    Type of data:          Number
    Contains null values:  False
    Unique values:         5
    Smallest value:        1,
    Largest value:         5,
    Sum:                   11.409,
    Mean:                  1,174
    Median:                1,
    StDev:                 0,464
    Most common values:    1, (8335x)
                           2, (1103x)
                           3, (245x)
                           4, (32x)
                           5, (1x)

Row count: 9716
xrotwang commented 2 months ago

Just confirmed my numbers via SQL:

select
  count(distinct f1.cldf_id)
from
  formtable as f1
join
  formtable as f2
on
  f1.cldf_form = f2.cldf_form and
  f1.cldf_parameterReference = f2.cldf_parameterReference
where
  f1.cldf_id not in (select cldf_formReference from cognatetable) and
  f2.cldf_id in (select cldf_formReference from cognatetable)

gives:

$ sqlite3 abvd.sqlite < q.sql 
9716
xrotwang commented 2 months ago

Or more performant and transparent:

select
  ncogs, count(cldf_id)
from (
  select
    f1.cldf_id, count(distinct c.cldf_cognatesetReference) as ncogs
  from
    formtable as f1
  join
    formtable as f2
  on
    f1.cldf_form = f2.cldf_form and
    f1.cldf_parameterReference = f2.cldf_parameterReference
  join
    cognatetable as c
  on
    f2.cldf_id = c.cldf_formReference
  where
    f1.cldf_id not in (select cldf_formReference from cognatetable)
  group by
    f1.cldf_id
)
group by ncogs

yielding

ncogs cldf_id
1 8335
2 1103
3 245
4 32
5 1
HedvigS commented 2 months ago

Thanks @xrotwang !

I don't fully follow those scripts, but that's okay. I think you're def right that my original approach over-estimates category 1 numbers.

I did figure out one improvement as I was reading through, the original script does spit back too many rows for category 1 (possible matches) because it doesn't cut down duplicates correctly. I've made a small adjustment so that it spits out a summary per unknown form instead:

#forms with missing cognacy that could probably be filled in easily
possible_matches <- dist_full %>% 
[abvd_possible_matches.csv](https://github.com/user-attachments/files/16026389/abvd_possible_matches.csv)

  filter(is.na(Cognacy_2)) %>% 
  filter(!is.na(Cognacy_1)) %>% 
  filter(lv_dist <= 0) %>% 
  group_by(Var2, Form_2) %>% 
  summarise(Cogancy_suggestions = paste0(unique(Cognacy_1), collapse = "; "))

There are much fewer rows that way, 2756 and it looks like this:

Var2 Form_2 Cogancy_suggestions
1-13_back-1 tundun 55
1-81_sharp-1 mangan 64
10-189_who-1 mei 65
100-177_this-1 di 13
100-88_tosqueeze-1 kuku 14
1000-185_we-1 kinta I
101-140_dry-1 kor 29
1013-185_we-3 kəlau I
1014-185_we-4 kam 2,65
1018-185_we-3 kir I
1018-185_we-4 kim E
102-127_woodsforest-1 ao 15,14
1029-185_we-4 kemem E42
119-182_i-1 ja 1; 1,21

(I used semi-colon to separate multiple suggestions, commas denote compound or sub-cognacy class.)

abvd_possible_matches.csv

If the ABVD-team wants to go ahead with either occurrences of type (1) and/or (2), then I'd be happy to dig into it more the differences between mine and Robert's approaches, yet other different ways of finding these etc. Right now, I just wanted to present the basic idea here and hear if anyone wants to proceed with it at all in terms of expert reviewing and implementation.

Category 1 occurrences (probably easy to fill in) seem most important to me, since software like BEAST2 will treat all missing cognacy classes as lots of unique cognacy classes, unless I'm mistaken. Category 2 is also important, but perhaps less so (also fewer occurrences).

xrotwang commented 2 months ago

FYI: Judging from the output, your script now seems to miss some possible matches, e.g. 999-206_ten-1

$ csvgrep -c Form -m"hampuluʔ" abvd-cldf/cldf/forms.csv | csvgrep -c Parameter_ID -m"206" | csvcut -c ID,Form,Cognacy
ID,Form,Cognacy
431-206_ten-1,hampuluʔ,5
999-206_ten-1,hampuluʔ,
SimonGreenhill commented 2 months ago

We do not want to do this. Yes, I realise that many cognates can be easily filled in, and are probably cognate. However,
we decided a very long time ago that each cognate judgement would be checked and no automated guessing would be done.

since software like BEAST2 will treat all missing cognacy classes as lots of unique cognacy classes.

No, it depends on how the nexus file is made.

HedvigS commented 2 months ago

We do not want to do this. Yes, I realise that many cognates can be easily filled in, and are probably cognate. However, we decided a very long time ago that each cognate judgement would be checked and no automated guessing would be done.

Okay, not doing. I was just following up on discussions with Russell and Mary and just wanted to get it wrapped up. One way of wrapping up is to not do it. Understood!

since software like BEAST2 will treat all missing cognacy classes as lots of unique cognacy classes.

No, it depends on how the nexus file is made.

Genau, I just thought that was how it was done for the analysis of ABVD.