lexibank / lexibank-analysed

Study on lexibank data (presenting the lexibank dataset).
Creative Commons Attribution 4.0 International
10 stars 3 forks source link

add additional data to lexibank app #41

Closed LinguList closed 1 year ago

LinguList commented 1 year ago

@xrotwang and @johenglisch and @chrzyki, please check this PR that adds a lexibank-file and makes this primarily a lexibank dataset, that also computes additional features. I reduce for dev purposes the list of data to be included to only 5 examplary datastes (etc/lexibank-dev.csv) but it should run also with all of the data now.

With this example dataset that adds (1) broad concepts from clics communities for fuzzy semantic search and (2) sound classes and CV templates for narrowing down the search for specific concepts, like words starting with M and meaning "mother", lexibank should allow for a rather successful filtering in the lexibank app.

LinguList commented 1 year ago

If we agree in principle on this workflow, I'd hand over the code and the details and would turn myself to finalizing the individual CLDF datasets.

johenglisch commented 1 year ago

Just a quick note: there seem to be both an old and new version of the script in this branch.

I took the liberty of deleting the old one – hopefully in a way that gives us a contiguous git history (and a readable diff).

LinguList commented 1 year ago

Yes, @johenglisch, you are right. I should've better just renamed the old one, sorry!

LinguList commented 1 year ago

Perfect, just saw you just did exactly that, thanks!

johenglisch commented 1 year ago

I also fixed some places that referenced the old python file in the code.

I'm a bit worried about the Wordlist line right here:

https://github.com/lexibank/lexibank-analysed/blob/b2d72ac08c2720b17459849827cc40dd5307aaf8/lexibank_lexibank_analysed.py#L357-L359

I remember in Annika's dataset that loading all datasets into memory at once like that completely ate up all RAM (and that was just a subset of what Lexibank Analysed is dealing with). That's why we ended up loading the data one contribution at a time:

https://github.com/lexibank/bodyobjectcolexifications/blob/fab87042c4a14cf1bb0023ee0394c3f2b314204e/cldfbench_tjukabodyobject.py#L217-L218

LinguList commented 1 year ago

Yes, it can be done independently, I just did not know how to tweak _add_language-function for my needs, since this is just loading the data, and then outputting it. So modification suggestions or follow-up PRs very welcome ;)

LinguList commented 1 year ago

The question, @johenglisch, is, if we run into memory problems in the end. But right now, the original code reads -- if I understand it correctly -- data multiple times (LexiCore, ClicsCore) although theoretically this would not be needed (?).

johenglisch commented 1 year ago

Okay, I just did a quick experiment (that didn't turn out that quick in the end (<_<)" ). I ran the program on the whole Lexibank set instead of just the dev sample (assuming that's what this script is intended for eventually?):

$ cp etc/lexibank.csv etc/lexibank-dev.csv
$ cldfbench download lexibank_lexibank_analysed.py
$ cldfbench lexibank.makecldf lexibank_lexibank_analysed.py

First try, the script ran into exception:

[...]
  File "./lexibank_lexibank_analysed.py", line 385, in cmd_makecldf
    writer.add_concept(
  File "ENV/lib/python3.8/site-packages/pylexibank/cldf.py", line 327, in add_concept
    raise ValueError('Concepticon ID / Gloss mismatch %s != %s' % (
ValueError: Concepticon ID / Gloss mismatch KID != YOUNG GOAT (KID)

I just quickly wrapped the line in a Gotta Catch 'em All and moved on. Also, I manually dereferenced the Wordlist object and triggered the gc, to at least give the program a chance:

args.log.info("GARBAGE COLLECT")
wl = None
import gc
args.log.info("COLLECTED {} THINGS".format(gc.collect()))

Results:

Long story short: I don't know what that Wordlist object does but the garbage collector clearly does not like it…

xrotwang commented 1 year ago

I understand that Wordlist offers a convenient API, but maybe this is the point in time where we try to implement some of its functionality on top of CLDF SQLite?

LinguList commented 1 year ago

I am not against using this opportunity to go for cldf-sqlite for cl-toolkit's Wordlist.

LinguList commented 1 year ago

As to memory usage: I think we only need to read ONE wordlist now. So I would opt for re-doing the code to not read them four times (!).

LinguList commented 1 year ago

As for concepticon errors: they are due to the fact that we need to update some datasets to concepticon 3.0.

LinguList commented 1 year ago

My suggestion would be:

  1. re-do code to read only one time all data into a wordlist (that SHOULD work)
  2. for any access of the data, think of implementing sqlite3 facility for cl-toolkit on top of the newly created cldf
LinguList commented 1 year ago

Note btw, that for this use-case, with > 100 datasets, the sqlite-workflow would require us to convert all those datasets to sqlite before, right? This may also not be the best solution.

But what we could think about is a solution where we do what Wordlist is doing, but automatize the procedure in such a way that we dump several cldf-lexibank datasets into one big sqlite for reuse.

LinguList commented 1 year ago

With such a command (in pylexibank?), we'd have all we need for aggregation, and aggregation is the major selling point of CLDF/Lexibank. To extract more fine-grained datasets, one would then use cldfbench to convert or add information to the aggregated sqlite and write it to cldf.

LinguList commented 1 year ago

So this sqlite-file would be created essentially in the download command of lexibank here.

LinguList commented 1 year ago

Important: I just added in raw/sources.bib ALL currently 116 datasets that are both in LexiCore and a subset of those is in ClicsCore and ProtoCore, so we have now the supposedly final set of data, and the sources are ALSO available in our sheet on Google, which we only need to update later. This means, we have an authoritative source for each dataset now (that can please be checked and enhanced).

What I realized: VanuatuVoices, here I don't know the status, as there is no sources.bib file and no version, should we drop it for now?

LinguList commented 1 year ago

I will merge this now. My next step would be to try and propose (I hope I find time on Thursday) an alternative procedure in which we just load the data for 100+ languages one time. If that is done, I'd be very happy to just hand this over to the others for a) enhancing my errors, and b) maybe thinking of ways to switch to sqlite-handling. But for now, I am confident a computer with 32 GB Ram can handle this in 30 Minutes.