Closed HedvigS closed 2 months ago
This could be a good approach to fill in gaps in ABVD. I couldn't get the same counts, though, when checking. I only count 8,335 cases of non-cognate-coded forms which are identical to cognate-coded forms in the same concept slot (vs your 24903 from case 1.)
Here's my code (it does not use the Cognacy
column, but the items from CognateTable
):
import collections
from csvw.dsv import UnicodeWriter
from pycldf import Dataset
def check(ds):
cogs = collections.defaultdict(set)
for c in ds.objects('CognateTable'):
cogs[c.cldf.formReference].add(c.cldf.cognatesetReference)
# Map form, concept pairs to sets of cognateset IDs
forms = collections.defaultdict(set)
for form in ds.objects('FormTable'):
if form.id in cogs: # A cognate-coded form
forms[(form.cldf.form, form.cldf.parameterReference)] |= cogs[form.id]
with UnicodeWriter('res.csv') as w:
w.writerow(['ID', 'Form', 'Language', 'Concept', 'Ncogs'])
for form in ds.objects('FormTable'):
if form.id not in cogs: # A non-cognate-coded form ...
if (form.cldf.form, form.cldf.parameterReference) in forms:
# ... but there are identical, cognate-coded forms.
csids = forms[(form.cldf.form, form.cldf.parameterReference)]
w.writerow([form.id, form.cldf.form, form.cldf.languageReference, form.cldf.parameterReference, len(csids)])
if __name__ == '__main__':
check(Dataset.from_metadata('cldf/cldf-metadata.json'))
The result is a table listing all 9716 forms that are not cognate-coded, but have identical forms for other languages which are, and the number of cognatesets these identical forms are assigned to, and I get:
5. "Ncogs"
Type of data: Number
Contains null values: False
Unique values: 5
Smallest value: 1,
Largest value: 5,
Sum: 11.409,
Mean: 1,174
Median: 1,
StDev: 0,464
Most common values: 1, (8335x)
2, (1103x)
3, (245x)
4, (32x)
5, (1x)
Row count: 9716
Just confirmed my numbers via SQL:
select
count(distinct f1.cldf_id)
from
formtable as f1
join
formtable as f2
on
f1.cldf_form = f2.cldf_form and
f1.cldf_parameterReference = f2.cldf_parameterReference
where
f1.cldf_id not in (select cldf_formReference from cognatetable) and
f2.cldf_id in (select cldf_formReference from cognatetable)
gives:
$ sqlite3 abvd.sqlite < q.sql
9716
Or more performant and transparent:
select
ncogs, count(cldf_id)
from (
select
f1.cldf_id, count(distinct c.cldf_cognatesetReference) as ncogs
from
formtable as f1
join
formtable as f2
on
f1.cldf_form = f2.cldf_form and
f1.cldf_parameterReference = f2.cldf_parameterReference
join
cognatetable as c
on
f2.cldf_id = c.cldf_formReference
where
f1.cldf_id not in (select cldf_formReference from cognatetable)
group by
f1.cldf_id
)
group by ncogs
yielding
ncogs | cldf_id |
---|---|
1 | 8335 |
2 | 1103 |
3 | 245 |
4 | 32 |
5 | 1 |
Thanks @xrotwang !
I don't fully follow those scripts, but that's okay. I think you're def right that my original approach over-estimates category 1 numbers.
I did figure out one improvement as I was reading through, the original script does spit back too many rows for category 1 (possible matches) because it doesn't cut down duplicates correctly. I've made a small adjustment so that it spits out a summary per unknown form instead:
#forms with missing cognacy that could probably be filled in easily
possible_matches <- dist_full %>%
[abvd_possible_matches.csv](https://github.com/user-attachments/files/16026389/abvd_possible_matches.csv)
filter(is.na(Cognacy_2)) %>%
filter(!is.na(Cognacy_1)) %>%
filter(lv_dist <= 0) %>%
group_by(Var2, Form_2) %>%
summarise(Cogancy_suggestions = paste0(unique(Cognacy_1), collapse = "; "))
There are much fewer rows that way, 2756 and it looks like this:
Var2 | Form_2 | Cogancy_suggestions |
---|---|---|
1-13_back-1 | tundun | 55 |
1-81_sharp-1 | mangan | 64 |
10-189_who-1 | mei | 65 |
100-177_this-1 | di | 13 |
100-88_tosqueeze-1 | kuku | 14 |
1000-185_we-1 | kinta | I |
101-140_dry-1 | kor | 29 |
1013-185_we-3 | kəlau | I |
1014-185_we-4 | kam | 2,65 |
1018-185_we-3 | kir | I |
1018-185_we-4 | kim | E |
102-127_woodsforest-1 | ao | 15,14 |
1029-185_we-4 | kemem | E42 |
119-182_i-1 | ja | 1; 1,21 |
(I used semi-colon to separate multiple suggestions, commas denote compound or sub-cognacy class.)
If the ABVD-team wants to go ahead with either occurrences of type (1) and/or (2), then I'd be happy to dig into it more the differences between mine and Robert's approaches, yet other different ways of finding these etc. Right now, I just wanted to present the basic idea here and hear if anyone wants to proceed with it at all in terms of expert reviewing and implementation.
Category 1 occurrences (probably easy to fill in) seem most important to me, since software like BEAST2 will treat all missing cognacy classes as lots of unique cognacy classes, unless I'm mistaken. Category 2 is also important, but perhaps less so (also fewer occurrences).
FYI: Judging from the output, your script now seems to miss some possible matches, e.g. 999-206_ten-1
$ csvgrep -c Form -m"hampuluʔ" abvd-cldf/cldf/forms.csv | csvgrep -c Parameter_ID -m"206" | csvcut -c ID,Form,Cognacy
ID,Form,Cognacy
431-206_ten-1,hampuluʔ,5
999-206_ten-1,hampuluʔ,
We do not want to do this. Yes, I realise that many cognates can be easily filled in, and are probably cognate. However,
we decided a very long time ago that each cognate judgement would be checked and no automated guessing would be done.
since software like BEAST2 will treat all missing cognacy classes as lots of unique cognacy classes.
No, it depends on how the nexus file is made.
We do not want to do this. Yes, I realise that many cognates can be easily filled in, and are probably cognate. However, we decided a very long time ago that each cognate judgement would be checked and no automated guessing would be done.
Okay, not doing. I was just following up on discussions with Russell and Mary and just wanted to get it wrapped up. One way of wrapping up is to not do it. Understood!
since software like BEAST2 will treat all missing cognacy classes as lots of unique cognacy classes.
No, it depends on how the nexus file is made.
Genau, I just thought that was how it was done for the analysis of ABVD.
I checked the FormTable for two kinds of issues that could be found easily and presented to a human reviewer for improvement of ABVD cognacy. Some forms may be identical but shouldn't belong to the same cognacy class, and vice versa, so human reviewing is necessary. I'm presenting these instances for the ABVD-team to consider.
The amount that can be filled in like this are 8% of the entire dataset, 24903 forms .
abvd_possible_matches.csv
There are 661 concept-form matchings ('144_toburn' - 'sunu') where there are identical forms assigned to different cognacy classes. If you also include cases of multiple cognacy (e.g. '1, 50') there are 4910 of this kind.
abvd_different_cognate_same_form_excl_multiple.csv
Rscript for finding these instances: