Closed LinguList closed 1 year ago
If we agree in principle on this workflow, I'd hand over the code and the details and would turn myself to finalizing the individual CLDF datasets.
Just a quick note: there seem to be both an old and new version of the script in this branch.
I took the liberty of deleting the old one – hopefully in a way that gives us a contiguous git history (and a readable diff).
Yes, @johenglisch, you are right. I should've better just renamed the old one, sorry!
Perfect, just saw you just did exactly that, thanks!
I also fixed some places that referenced the old python file in the code.
I'm a bit worried about the Wordlist
line right here:
I remember in Annika's dataset that loading all datasets into memory at once like that completely ate up all RAM (and that was just a subset of what Lexibank Analysed is dealing with). That's why we ended up loading the data one contribution at a time:
Yes, it can be done independently, I just did not know how to tweak _add_language-function for my needs, since this is just loading the data, and then outputting it. So modification suggestions or follow-up PRs very welcome ;)
The question, @johenglisch, is, if we run into memory problems in the end. But right now, the original code reads -- if I understand it correctly -- data multiple times (LexiCore, ClicsCore) although theoretically this would not be needed (?).
Okay, I just did a quick experiment (that didn't turn out that quick in the end (<_<)" ). I ran the program on the whole Lexibank set instead of just the dev sample (assuming that's what this script is intended for eventually?):
$ cp etc/lexibank.csv etc/lexibank-dev.csv
$ cldfbench download lexibank_lexibank_analysed.py
$ cldfbench lexibank.makecldf lexibank_lexibank_analysed.py
First try, the script ran into exception:
[...]
File "./lexibank_lexibank_analysed.py", line 385, in cmd_makecldf
writer.add_concept(
File "ENV/lib/python3.8/site-packages/pylexibank/cldf.py", line 327, in add_concept
raise ValueError('Concepticon ID / Gloss mismatch %s != %s' % (
ValueError: Concepticon ID / Gloss mismatch KID != YOUNG GOAT (KID)
I just quickly wrapped the line in a Gotta Catch 'em All and moved on. Also, I manually dereferenced the Wordlist
object and triggered the gc, to at least give the program a chance:
args.log.info("GARBAGE COLLECT")
wl = None
import gc
args.log.info("COLLECTED {} THINGS".format(gc.collect()))
Results:
wl
memory usage was at 13.5 gigs.LexiCore
dataset a second time, my system (24 gigs of RAM) started swapping.ClicsCore
) swap was full and I had to Ctrl+C
the program…Long story short: I don't know what that Wordlist
object does but the garbage collector clearly does not like it…
I understand that Wordlist
offers a convenient API, but maybe this is the point in time where we try to implement some of its functionality on top of CLDF SQLite?
I am not against using this opportunity to go for cldf-sqlite for cl-toolkit's Wordlist.
As to memory usage: I think we only need to read ONE wordlist now. So I would opt for re-doing the code to not read them four times (!).
As for concepticon errors: they are due to the fact that we need to update some datasets to concepticon 3.0.
My suggestion would be:
Note btw, that for this use-case, with > 100 datasets, the sqlite-workflow would require us to convert all those datasets to sqlite before, right? This may also not be the best solution.
But what we could think about is a solution where we do what Wordlist is doing, but automatize the procedure in such a way that we dump several cldf-lexibank datasets into one big sqlite for reuse.
With such a command (in pylexibank?), we'd have all we need for aggregation, and aggregation is the major selling point of CLDF/Lexibank. To extract more fine-grained datasets, one would then use cldfbench to convert or add information to the aggregated sqlite and write it to cldf.
So this sqlite-file would be created essentially in the download command of lexibank here.
Important: I just added in raw/sources.bib
ALL currently 116 datasets that are both in LexiCore and a subset of those is in ClicsCore and ProtoCore, so we have now the supposedly final set of data, and the sources are ALSO available in our sheet on Google, which we only need to update later. This means, we have an authoritative source for each dataset now (that can please be checked and enhanced).
What I realized: VanuatuVoices, here I don't know the status, as there is no sources.bib file and no version, should we drop it for now?
I will merge this now. My next step would be to try and propose (I hope I find time on Thursday) an alternative procedure in which we just load the data for 100+ languages one time. If that is done, I'd be very happy to just hand this over to the others for a) enhancing my errors, and b) maybe thinking of ways to switch to sqlite-handling. But for now, I am confident a computer with 32 GB Ram can handle this in 30 Minutes.
@xrotwang and @johenglisch and @chrzyki, please check this PR that adds a lexibank-file and makes this primarily a lexibank dataset, that also computes additional features. I reduce for dev purposes the list of data to be included to only 5 examplary datastes (
etc/lexibank-dev.csv
) but it should run also with all of the data now.With this example dataset that adds (1) broad concepts from clics communities for fuzzy semantic search and (2) sound classes and CV templates for narrowing down the search for specific concepts, like words starting with M and meaning "mother", lexibank should allow for a rather successful filtering in the lexibank app.