Closed chrzyki closed 4 years ago
Yes, the typical cogid is an integer, for good reasons. however, there's an easy workaround, and you should use int ANYWAY, as abvd, has "local" cognates! See our tutorial, where we also refine them explicitly (lingpy tutorial in Journal of language evolution from 2018).
>>> wl = Wordlist.from_cldf(...) # you know how you import here
>>> wl.add_entries('cog', 'concept','cognacy', lambda x,y: x[y[0]]+'-'+x[y[1]])
>>> wl.renumber('cog')
This will create a global cognate id, numeric, from the entry "cog", so then it should work.
But please, @KonstantinHoffmann, could you show me the code you used to import abvd in the end? I mean, you can easily tweak the cldf import as well, we just changed that.
Ahh, I guess it won't load at all because forms.csv has a cognate column (which has non-ints in it, and is therefore raising that error).
Okay, I'm working on this, instructions follow soon, how to circumvent, no bug on neither side.
Sorry, has been a while since I used that and got confused by the content of the Cognacy column.
Hmm, we should make the abvd cldf output use global cognate ids though, right?
But the cognacy
column is basically what Value
is to Form
: Only relaying the information as in the source. The actual, cleaned up cgnacy relations are in the CognateTable
, with global Cognateset_ID
s.
from lingpy import *
from lexibank_abvd import Dataset as ABVDDataset
# load abvd
wl = Wordlist.from_cldf(ABVDDataset().cldf_dir.joinpath('cldf-metadata.json'),
columns=('parameter_id', 'concept_name', 'language_id', 'language_name', 'value', 'form', 'segments', 'language_glottocode', 'concept_concepticon_id', 'cognacy'),
namespace=namespace=(('concept_name', 'concept'), ('language_id', 'doculect'), ('segments', 'tokens'), ('language_glottocode', 'glottolog'), ('concept_concepticon_id', 'concepticon')))
# convert cognacy to global cogid apt for lingpy, a bit complicated, due to missing values and ABVD-specific cognate
C, CH = {}, {}
cogid = 1
for idx, cognacy in wl.iter_rows('cognacy'):
if not cognacy: C[idx] = 0;
else:
tmp = wl[idx, 'concept']+'-'+cognacy.split(',')[0]
if tmp in CH:
C[idx] = CH[tmp]
else:
C[idx] = cogid
CH[tmp] = cogid
cogid += 1
wl.add_entries('cogid', C, lambda x: x)
Using the cognacy
column in forms.csv
is ignoring (or duplicating) all the work being done on parsing the raw info here:
https://github.com/lexibank/pylexibank/blob/00f86c8cb1990df0ee6e567577daaacd6b880e0a/src/pylexibank/providers/abvd.py#L206-L240
I guess, the Wordlist.from_cldf
can also load the "real" cognates. But I'd have to check in teh function to see what commands to give for that.
However, does the lexibank code actually check for commas in the cognates, for uncertain cognates, etc? TThis would be an important question, as otherwise, lexibank code should be updated. And also: what should happen with uncertainties, etc?
Okay, got it, @KonstantinHoffman, @chrzyki, also important for our MIT tutorial:
In [6]: wl = Wordlist.from_cldf('cldf/cldf-metadata.json', columns=(
...: 'parameter_id',
...: 'concept_name',
...: 'language_id',
...: 'language_name',
...: 'value',
...: 'form',
...: 'segments',
...: 'language_glottocode',
...: 'concept_concepticon_id',
...: 'language_latitude',
...: 'language_longitude',
...: 'cognacy',
...: 'cogid_id',
...: ),
...: namespace=(
...: ('concept_name', 'concept'),
...: ('language_id', 'doculect'),
...: ('segments', 'tokens'),
...: ('language_glottocode', 'glottolog'),
...: ('concept_concepticon_id', 'concepticon'),
...: ('language_latitude', 'latitude'),
...: ('language_longitude', 'longitude'),
...: ('cognacy', 'cognacy'),
...: ('cogid_id', 'cog')
...: ))
In [5]: wl.renumber('cog')
this can be simplified, but I don't have the time to do that, and there's too much variation, so we say: be explicit, rather than implicit, so we leave the code as is, but @chrzyki, you should include exactly THAT code in our tutorial, if you're working on that, as this also clarifies the interaction.
@LinguList yes, uncertainty is handled rather transparently: https://github.com/lexibank/pylexibank/blob/00f86c8cb1990df0ee6e567577daaacd6b880e0a/src/pylexibank/providers/abvd.py#L237-L240
yes, commas are handled: https://github.com/lexibank/pylexibank/blob/00f86c8cb1990df0ee6e567577daaacd6b880e0a/src/pylexibank/providers/abvd.py#L138-L141
As far as I'm concerned, tutorials should use data from the CognateTable
rather than from the short-cut, non-standard cognacy
column in FormTable
.
Okay, then solution above is in fact the preferred one. I agree witth you, @xrotwang, but I would not make reading cognates from cldf default, as the majority of datasets don't have them either... so I'd leave lingpy as this, and we shoudl use the lengthy instructions as to which columns to important and how to namespace them explicitly.
@LinguList I don't really understand. As far as I can tell, the code you give above still does not look up Cognateset_ID
from CognateTable
- which it would need to join on Form_ID
. This isn't terribly difficult, e.g. using csvkit
it's just a csvjoin -c ID,Form_ID forms.csv cognates.csv
. We should not promote short-cuts or non-standard conventions in tutorials, when what we want is lead people to best practices.
LingPy should do it. I have to admit, I find that part of the lingpy code highly cryptic, but @Anaphory wrote it specifically by merging the different tables, as far as I can tell, but I do NOT know, and I'd appreciate if we could work on this, to which degree the code can read the partial cognates.
In fact cogid_id
is the normal namespace that @Anaphory gave for the conversion, so if you pass this as argument, LingPy pulls that column, and it is already linked.
Given that we still have some time for the tutorial, it may be good to check these things in LingPy, as we can then make a new version also on pip that actually conforms to a behavior with which we would be content?
Ah, ok. Have to admit to not looking at the LIngPy code either. But if tables are merged somewhere, that explains your code snippets somewhat. I still don't see Cognateset_ID
or cognatesetReference
anywhere, which would be the standard column to map to cog_id
.
Oh, it is done in fact, but quite cryptical to read. Gereon gave namespace like: concept_*
for the parameters table, and then the value there, language_*
for languages, and cogid_*
for the cognates. So when I pull cogid_id
as a column, I pull exactly Cognateset_ID
, as this is referenced as ID
in the cognates.csv
, right?
Or maybe not. @KonstantinHoffmann or @chrzyki could you try to replace 'cogid_id'
in the code by cogid_cognateset_id
to see if the ID from cognatesets.csv
is actually pulled? And is this global in abvd, or global per default, @xrotwang?
Yep, I just confirmed it, if you want the cognateset_id
, you need to use cogid_cognateset_id
. So the example above needs ot be replaced. I'll make a PR for lingpy that facilitates the import with namespaces later.
cognatesets.csv
is optional. In fact, ABVD does not have it, see https://github.com/lexibank/abvd/tree/master/cldf
So the correct thing to do is use cognatesetReference
or Cognateset_ID
from CognateTable
- which by default is called cognates.csv
.
Thanks for investigating and discussing, @xrotwang and @LinguList - anything you need me to do? (sounded like you already tested what you wanted to see tested above, @LinguList)
Hmm. Would it be better to have a cldf exporter (perhaps living in pycldf) that can generate outputs for other programs? - cldf.to-lingpy(), cldf.to-nexus()?
@SimonGreenhill Maybe - maybe - in a separate package. But generally, this approach doesn't scale. And for LingPy in particular, I'd argue that CLDF should become its default input format - although I admit that it's mostly my fault that this is not the case already ...
Thanks for investigating and discussing, @xrotwang and @LinguList - anything you need me to do? (sounded like you already tested what you wanted to see tested above, @LinguList)
Yes, please make sure to describe this function in the tutorial for our draft, so that it is well explained. If you can actually check the very function, this may also be helpful, as it is always good if more than one person checks a piece of code.
CLDF should become its default input format - although I admit that it's mostly my fault that this is not the case already ...
There are of course complications for this, but I am now testing the clidf import frequently, although I store immediate results in wordlists, as specifically the OUTPUT is not fully supported yet. And what is useful here in lingpy is the fact that it's just giving you all flexibility to add random columns ot the simple tsv, which I expect may need be addressed explicitly also in cldf export and import.
ok, but we still need to give ABVD global cognates to follow lexibank guidelines.
@LinguList - can lingpy handle lexibank style global cognates of the form "all-1", "hand-2", etc?
There's a simple command, called "renumber", usage is:
Wordlist.renumber('column')
Per default, it takes the strings from column 'column'
and makes them numeric (global), creating a new column with name'column'+'id'
.
Essentially what my code example above contains:
In [5]: wl.renumber('cog')
so 'cog' creates new column 'cogid'
Ah, important: singletons that are "none" are also considered problematic, although lingpy now accepts '0' as a cognate id that is unprocessed, I'd recommend to really name different cognates distinctly to be on the save side here. Otherwise, you have 'hand-none', 'hand-none', and the renumbering will treat it as cognate...
Closing as this is no longer valid.
ABVD can't be imported with lingpy because the cognacy column has invalid values, amongst them:
29?, 1,83,
etc. Thanks to @konstantinhoffmann for pointing this out.