lexibank / uralex

UraLex basic vocabulary dataset
Creative Commons Attribution 4.0 International
3 stars 5 forks source link

Code compound words consistently #5

Open lmaurits opened 5 years ago

lmaurits commented 5 years ago

Currently rainbow (and possibly others, but probably not) is coded such that each compound form gets one cognate set for each part of the compound, in order to enable representing partial cognacy (e.g. Finnish sateenkaari and Karelian ukonkoari, where kaari and koari are related but sateen ("of rain") and ukon ("of thunder") are not.

This is:

  1. Inconsistent with our explanation of how we code cognates in the included documentation.
  2. Inconsistent with how we have coded other compound words with partial cognacy (e.g. vulture where partial cognacy between Finnish and Estonian is not represented)
  3. Problematic for phylogenetic inference because it introduces an exceptionally high number of singleton cognate sets can which skew rate/age estimates.

rainbow should be recoded so that each form is associated with only one cognate set, and only cognacy in both components of compounds counts.

xrotwang commented 5 years ago

The CLDF format would also support partial cognates - but as far as I know, there's no obvious solution yet for how to binarize these cognates, correct @LinguList ?

LinguList commented 5 years ago

We are working on an explicit solution, but this requires manual work, that you use a tool like EDICTOR, and that the coders acquaintain themselvse with the general ideas here. As far as I followed the way in which cognates were coded in Uralex so far, this therefore does not seem possible. But if there is a definite interest from Uralex' side to learn how to do a consistent root-cognate coding, I'll gladly share the workflow and can also help in setting up the tools, so they can annotate their data properly. But this will involve quite some work for the coders.