lexibank / abvd

CLDF dataset derived from Greenhill et al.'s "Austronesian Basic Vocabulary Database" from 2020.
https://abvd.eva.mpg.de
Creative Commons Attribution 4.0 International
2 stars 2 forks source link

forms.csv Cognacy column has floats/strings, lingpy expects ints #2

Closed chrzyki closed 4 years ago

chrzyki commented 5 years ago
lingpy/basictypes.py in <lambda>(x)
     29         list.__setitem__(self, index, self._type(item))
     30 
---> 31 integer = lambda x: int(x) if x else 0
     32 strings = partial(_strings, str)
     33 ints = partial(_strings, int)

ValueError: invalid literal for int() with base 10: '1,64'

ABVD can't be imported with lingpy because the cognacy column has invalid values, amongst them: 29?, 1,83, etc. Thanks to @konstantinhoffmann for pointing this out.

LinguList commented 5 years ago

Yes, the typical cogid is an integer, for good reasons. however, there's an easy workaround, and you should use int ANYWAY, as abvd, has "local" cognates! See our tutorial, where we also refine them explicitly (lingpy tutorial in Journal of language evolution from 2018).

>>> wl = Wordlist.from_cldf(...) # you know how you import here
>>> wl.add_entries('cog', 'concept','cognacy', lambda x,y: x[y[0]]+'-'+x[y[1]])
>>> wl.renumber('cog')

This will create a global cognate id, numeric, from the entry "cog", so then it should work.

LinguList commented 5 years ago

But please, @KonstantinHoffmann, could you show me the code you used to import abvd in the end? I mean, you can easily tweak the cldf import as well, we just changed that.

SimonGreenhill commented 5 years ago

Ahh, I guess it won't load at all because forms.csv has a cognate column (which has non-ints in it, and is therefore raising that error).

LinguList commented 5 years ago

Okay, I'm working on this, instructions follow soon, how to circumvent, no bug on neither side.

chrzyki commented 5 years ago

Sorry, has been a while since I used that and got confused by the content of the Cognacy column.

SimonGreenhill commented 5 years ago

Hmm, we should make the abvd cldf output use global cognate ids though, right?

xrotwang commented 5 years ago

But the cognacy column is basically what Value is to Form: Only relaying the information as in the source. The actual, cleaned up cgnacy relations are in the CognateTable, with global Cognateset_IDs.

LinguList commented 5 years ago
from lingpy import *
from lexibank_abvd import Dataset as ABVDDataset

# load abvd
wl = Wordlist.from_cldf(ABVDDataset().cldf_dir.joinpath('cldf-metadata.json'), 
    columns=('parameter_id', 'concept_name', 'language_id', 'language_name', 'value', 'form', 'segments', 'language_glottocode', 'concept_concepticon_id', 'cognacy'),
    namespace=namespace=(('concept_name', 'concept'), ('language_id', 'doculect'), ('segments', 'tokens'), ('language_glottocode', 'glottolog'), ('concept_concepticon_id', 'concepticon')))

# convert cognacy to global cogid apt for lingpy, a bit complicated, due to missing values and ABVD-specific cognate
C, CH = {}, {}
cogid = 1
for idx, cognacy in wl.iter_rows('cognacy'): 
    if not cognacy: C[idx] = 0;  
    else:  
        tmp = wl[idx, 'concept']+'-'+cognacy.split(',')[0] 
        if tmp in CH: 
            C[idx] = CH[tmp] 
        else: 
            C[idx] = cogid 
            CH[tmp] = cogid 
            cogid += 1
wl.add_entries('cogid', C, lambda x: x) 
xrotwang commented 5 years ago

Using the cognacy column in forms.csv is ignoring (or duplicating) all the work being done on parsing the raw info here: https://github.com/lexibank/pylexibank/blob/00f86c8cb1990df0ee6e567577daaacd6b880e0a/src/pylexibank/providers/abvd.py#L206-L240

LinguList commented 5 years ago

I guess, the Wordlist.from_cldf can also load the "real" cognates. But I'd have to check in teh function to see what commands to give for that.

LinguList commented 5 years ago

However, does the lexibank code actually check for commas in the cognates, for uncertain cognates, etc? TThis would be an important question, as otherwise, lexibank code should be updated. And also: what should happen with uncertainties, etc?

LinguList commented 5 years ago

Okay, got it, @KonstantinHoffman, @chrzyki, also important for our MIT tutorial:

In [6]: wl = Wordlist.from_cldf('cldf/cldf-metadata.json',             columns=( 
   ...:                 'parameter_id', 
   ...:                 'concept_name', 
   ...:                 'language_id', 
   ...:                 'language_name', 
   ...:                 'value', 
   ...:                 'form', 
   ...:                 'segments', 
   ...:                 'language_glottocode', 
   ...:                 'concept_concepticon_id', 
   ...:                 'language_latitude', 
   ...:                 'language_longitude', 
   ...:                 'cognacy', 
   ...:                 'cogid_id', 
   ...:                 ), 
   ...:             namespace=( 
   ...:                ('concept_name', 'concept'), 
   ...:                ('language_id', 'doculect'), 
   ...:                ('segments', 'tokens'), 
   ...:                ('language_glottocode', 'glottolog'), 
   ...:                ('concept_concepticon_id', 'concepticon'), 
   ...:                ('language_latitude', 'latitude'), 
   ...:                ('language_longitude', 'longitude'), 
   ...:                ('cognacy', 'cognacy'), 
   ...:                ('cogid_id', 'cog') 
   ...:                )) 
In [5]: wl.renumber('cog')   
LinguList commented 5 years ago

this can be simplified, but I don't have the time to do that, and there's too much variation, so we say: be explicit, rather than implicit, so we leave the code as is, but @chrzyki, you should include exactly THAT code in our tutorial, if you're working on that, as this also clarifies the interaction.

xrotwang commented 5 years ago

@LinguList yes, uncertainty is handled rather transparently: https://github.com/lexibank/pylexibank/blob/00f86c8cb1990df0ee6e567577daaacd6b880e0a/src/pylexibank/providers/abvd.py#L237-L240

yes, commas are handled: https://github.com/lexibank/pylexibank/blob/00f86c8cb1990df0ee6e567577daaacd6b880e0a/src/pylexibank/providers/abvd.py#L138-L141

xrotwang commented 5 years ago

As far as I'm concerned, tutorials should use data from the CognateTable rather than from the short-cut, non-standard cognacy column in FormTable.

LinguList commented 5 years ago

Okay, then solution above is in fact the preferred one. I agree witth you, @xrotwang, but I would not make reading cognates from cldf default, as the majority of datasets don't have them either... so I'd leave lingpy as this, and we shoudl use the lengthy instructions as to which columns to important and how to namespace them explicitly.

xrotwang commented 5 years ago

@LinguList I don't really understand. As far as I can tell, the code you give above still does not look up Cognateset_ID from CognateTable - which it would need to join on Form_ID. This isn't terribly difficult, e.g. using csvkit it's just a csvjoin -c ID,Form_ID forms.csv cognates.csv. We should not promote short-cuts or non-standard conventions in tutorials, when what we want is lead people to best practices.

LinguList commented 5 years ago

LingPy should do it. I have to admit, I find that part of the lingpy code highly cryptic, but @Anaphory wrote it specifically by merging the different tables, as far as I can tell, but I do NOT know, and I'd appreciate if we could work on this, to which degree the code can read the partial cognates.

LinguList commented 5 years ago

In fact cogid_id is the normal namespace that @Anaphory gave for the conversion, so if you pass this as argument, LingPy pulls that column, and it is already linked.

Given that we still have some time for the tutorial, it may be good to check these things in LingPy, as we can then make a new version also on pip that actually conforms to a behavior with which we would be content?

LinguList commented 5 years ago

https://github.com/lingpy/lingpy/blob/89a7cd8e3fae71c807191ce1b1d355e48a39d1ae/lingpy/basic/wordlist.py#L1138-L1144

xrotwang commented 5 years ago

Ah, ok. Have to admit to not looking at the LIngPy code either. But if tables are merged somewhere, that explains your code snippets somewhat. I still don't see Cognateset_ID or cognatesetReference anywhere, which would be the standard column to map to cog_id.

LinguList commented 5 years ago

Oh, it is done in fact, but quite cryptical to read. Gereon gave namespace like: concept_* for the parameters table, and then the value there, language_* for languages, and cogid_* for the cognates. So when I pull cogid_id as a column, I pull exactly Cognateset_ID, as this is referenced as ID in the cognates.csv, right?

LinguList commented 5 years ago

Or maybe not. @KonstantinHoffmann or @chrzyki could you try to replace 'cogid_id' in the code by cogid_cognateset_id to see if the ID from cognatesets.csv is actually pulled? And is this global in abvd, or global per default, @xrotwang?

LinguList commented 5 years ago

Yep, I just confirmed it, if you want the cognateset_id, you need to use cogid_cognateset_id. So the example above needs ot be replaced. I'll make a PR for lingpy that facilitates the import with namespaces later.

xrotwang commented 5 years ago

cognatesets.csv is optional. In fact, ABVD does not have it, see https://github.com/lexibank/abvd/tree/master/cldf So the correct thing to do is use cognatesetReference or Cognateset_ID from CognateTable - which by default is called cognates.csv.

chrzyki commented 5 years ago

Thanks for investigating and discussing, @xrotwang and @LinguList - anything you need me to do? (sounded like you already tested what you wanted to see tested above, @LinguList)

SimonGreenhill commented 5 years ago

Hmm. Would it be better to have a cldf exporter (perhaps living in pycldf) that can generate outputs for other programs? - cldf.to-lingpy(), cldf.to-nexus()?

xrotwang commented 5 years ago

@SimonGreenhill Maybe - maybe - in a separate package. But generally, this approach doesn't scale. And for LingPy in particular, I'd argue that CLDF should become its default input format - although I admit that it's mostly my fault that this is not the case already ...

LinguList commented 5 years ago

Thanks for investigating and discussing, @xrotwang and @LinguList - anything you need me to do? (sounded like you already tested what you wanted to see tested above, @LinguList)

Yes, please make sure to describe this function in the tutorial for our draft, so that it is well explained. If you can actually check the very function, this may also be helpful, as it is always good if more than one person checks a piece of code.

LinguList commented 5 years ago

CLDF should become its default input format - although I admit that it's mostly my fault that this is not the case already ...

There are of course complications for this, but I am now testing the clidf import frequently, although I store immediate results in wordlists, as specifically the OUTPUT is not fully supported yet. And what is useful here in lingpy is the fact that it's just giving you all flexibility to add random columns ot the simple tsv, which I expect may need be addressed explicitly also in cldf export and import.

SimonGreenhill commented 5 years ago

ok, but we still need to give ABVD global cognates to follow lexibank guidelines.

@LinguList - can lingpy handle lexibank style global cognates of the form "all-1", "hand-2", etc?

LinguList commented 5 years ago

There's a simple command, called "renumber", usage is:

Wordlist.renumber('column')

Per default, it takes the strings from column 'column' and makes them numeric (global), creating a new column with name'column'+'id'.

LinguList commented 5 years ago

Essentially what my code example above contains:

In [5]: wl.renumber('cog')  

so 'cog' creates new column 'cogid'

LinguList commented 5 years ago

Ah, important: singletons that are "none" are also considered problematic, although lingpy now accepts '0' as a cognate id that is unprocessed, I'd recommend to really name different cognates distinctly to be on the save side here. Otherwise, you have 'hand-none', 'hand-none', and the renumbering will treat it as cognate...

SimonGreenhill commented 4 years ago

Closing as this is no longer valid.