lexibank / pylexibank

The python curation library for lexibank
Apache License 2.0
18 stars 7 forks source link

Cognate sets from TOB provider are not global #202

Closed SimonGreenhill closed 4 years ago

SimonGreenhill commented 4 years ago

The TOB provider does not make cognate sets global (i.e. the cognate set id 1 is reused across parameters. See https://github.com/lexibank/starostinkaren/blob/master/cldf/cognates.csv

I think we've discussed this in the past, but is there a way we can add a lexibank check for this in future (e.g. cognate set ids must be constrained to one parameter? but this then means we can't have cognates that go across parameters..)

xrotwang commented 4 years ago

Ok, the TOB provider needs to be fixed. But do we want datasets to explicitly opt in to cross-semantic cognacy? I think one could argue for this, since it is so uncommon, that the explicit flag could also be useful for people re-using the data. OTOH it's another piece of boilerplate.

SimonGreenhill commented 4 years ago

Just about to make a PR for TOB.

I guess the first question is, do we have any datasets with cross-semantic cognates? (do you know of any @LinguList? I don't).

I think cross-semantic cognates are really uncommon so we should be able to assume that cognates are global, otherwise we flag this somehow.

xrotwang commented 4 years ago

Ok, see #203

LinguList commented 4 years ago

I guess the first question is, do we have any datasets with cross-semantic cognates? (do you know of any @LinguList? I don't).

Important: we need to make cognate sets global, and the non-global case is the exception.

LinguList commented 4 years ago

That means, you cannot check if they are cross-semantic or not, you can only check if they are not cross-semantic.

LinguList commented 4 years ago

And this is undebatable, since we already have cross-semantic datasets, and we even submitted a paper on cross-semantic cognate detection for partial cognates.

LinguList commented 4 years ago

To make cognate sets save-global, one can always make a new ID in combination with parameters.id.

In lingpy, there's a "renumber" function for this case: you give a string-id and you receive a numerical id. Making a dataset save-global (if encoding is local) is then:

>>> wl = Wordlist(...)
>>> wl.add_entries('cog', 'concept,cognacy', lambda x: slug(x[y[0]])+'-'+str(x[y[1]]))
>>> wl.renumber('cog')

But that's just an FYI.

SimonGreenhill commented 4 years ago

Ok -- so perhaps the solution is to put this information in the README e.g. change this:

- **Cognacy:** 171 cognates in 39 cognate sets (11 singletons)

to something like:

- **Cognacy:** 171 cognates in 39 cognate sets (11 singletons). Cognates are {global, cross-semantic}.
LinguList commented 4 years ago

Well, cognates are always global, that's the default I'd opt for, and it is our responsibility to guarantee that in CLDF. The cognate IDs are global, but an annotation can ignore to search for cross-semantic cognates. So as you can always check of a cognate-id crosses semantic boundaries, you would then say: it is cross-semantic, and in the other case that it is not, right?