Closed SimonGreenhill closed 4 years ago
Ok, the TOB provider needs to be fixed. But do we want datasets to explicitly opt in to cross-semantic cognacy? I think one could argue for this, since it is so uncommon, that the explicit flag could also be useful for people re-using the data. OTOH it's another piece of boilerplate.
Just about to make a PR for TOB.
I guess the first question is, do we have any datasets with cross-semantic cognates? (do you know of any @LinguList? I don't).
I think cross-semantic cognates are really uncommon so we should be able to assume that cognates are global, otherwise we flag this somehow.
Ok, see #203
I guess the first question is, do we have any datasets with cross-semantic cognates? (do you know of any @LinguList? I don't).
Important: we need to make cognate sets global, and the non-global case is the exception.
That means, you cannot check if they are cross-semantic or not, you can only check if they are not cross-semantic.
And this is undebatable, since we already have cross-semantic datasets, and we even submitted a paper on cross-semantic cognate detection for partial cognates.
To make cognate sets save-global, one can always make a new ID in combination with parameters.id.
In lingpy, there's a "renumber" function for this case: you give a string-id and you receive a numerical id. Making a dataset save-global (if encoding is local) is then:
>>> wl = Wordlist(...)
>>> wl.add_entries('cog', 'concept,cognacy', lambda x: slug(x[y[0]])+'-'+str(x[y[1]]))
>>> wl.renumber('cog')
But that's just an FYI.
Ok -- so perhaps the solution is to put this information in the README e.g. change this:
- **Cognacy:** 171 cognates in 39 cognate sets (11 singletons)
to something like:
- **Cognacy:** 171 cognates in 39 cognate sets (11 singletons). Cognates are {global, cross-semantic}.
Well, cognates are always global, that's the default I'd opt for, and it is our responsibility to guarantee that in CLDF. The cognate IDs are global, but an annotation can ignore to search for cross-semantic cognates. So as you can always check of a cognate-id crosses semantic boundaries, you would then say: it is cross-semantic, and in the other case that it is not, right?
The TOB provider does not make cognate sets global (i.e. the cognate set id 1 is reused across parameters. See https://github.com/lexibank/starostinkaren/blob/master/cldf/cognates.csv
I think we've discussed this in the past, but is there a way we can add a lexibank check for this in future (e.g. cognate set ids must be constrained to one parameter? but this then means we can't have cognates that go across parameters..)