lexibank / pylexibank

The python curation library for lexibank
Apache License 2.0
18 stars 7 forks source link

Concept ALIAS column #222

Closed chrzyki closed 4 years ago

chrzyki commented 4 years ago

Please see the discussion here:

https://github.com/lexibank/marrisonnaga/issues/21

Sometimes, there is a mismatch between digitised version of lists (e.g. as available on STEDT) and the original source material. Using the digitised version makes things easier for the Lexibank workflow, but may result in mismatches as outlined in the marrisonnaga issue. @LinguList's proposal for an ALIAS seems good to me.

Do you have any preferences how in particular this should be handled? I'm aware that this is also related to how we handle concept lists in concepticon-data, but since this mainly concerns mappings of Lexibank datasets, I'm opening the issue here.

xrotwang commented 4 years ago

I don't fully understand. What kind of support is needed in pylexibank? As far as I can tell, functions passed in to add_concepts have access to all data in the concepticon concept list - so isn't this just a question of completing the data in Concepticon?

chrzyki commented 4 years ago

I agree - I don't think there is any particular need for code in here that handles this, but I was thinking that this might warrant a small discussion concerning 'best practices' for this purpose, i.e. "if need be, try calling the column xyz". The issue might be a better fit for lexibank/lexibank.

xrotwang commented 4 years ago

Or even an issue for concepticon/concepticon-data? But I'm not really sure this kind of special case would require some standardized alternative_label column. We are using all kinds of alternative labels already - numbers, local identifiers, etc.

xrotwang commented 4 years ago

So if anything, I'd say this is a documentation issue for the particular dataset. And dealing with it by putting a comment above the add_concepts line in lexibank_*.py is enough?

xrotwang commented 4 years ago

On second thought, maybe the documentation should go into NOTES.md. Something along the lines of

... while the published concept list uses the labels such-and-such, the actual wordlists use slightly different labels ...

I don't think we have to go all the way and formalize this into a ConceptSpec, which documents lookup in Concepticon.

chrzyki commented 4 years ago

Yes, that sounds good - thanks for the input. @LinguList do you agree with this? If so, we can close this and keep the issue for reference.

LinguList commented 4 years ago

Yes, I fully agree. It is not a pylexibank issue, but a more generalized handling of the problem of receiving data in lexibank through a secondary source. Since this is happening in several cases, and since the relation concepticon -> source, concepticon -> digital source is stable, but we prefer to highlight the relation concepticon -> source, it is useful to document this as a best practice example, with the potentially comma-separated list of alias concepts in the ALIAS column and the intended target concept in GLOSS or ENGLISH or any other language.

chrzyki commented 4 years ago

See https://github.com/lexibank/lexibank/pull/238.