lexibank / lexibank-analysed

Study on lexibank data (presenting the lexibank dataset).
Creative Commons Attribution 4.0 International
10 stars 3 forks source link

Use `cldfzenodo` to download data #33

Closed johenglisch closed 6 months ago

johenglisch commented 3 years ago

Note: cldfzenodo seems to only download the cldf data itself, not the entire bench. cldfbench cldfmake seems to work just fine, but I don't know if this could break any other commands.

xrotwang commented 3 years ago

You can also download the complete repos, using Record.download, rather than download_dataset. But getting just the CLDF should work in these cases. (It might fail if there's more than one CLDF dataset in the same repository, which is the case for lexibank/hindukush - but that's not part of LexiCore, I think)

johenglisch commented 3 years ago

Well, if downloading the whole bench is more robust, then we might as well do it – reduces the chances of surprises later on.

xrotwang commented 3 years ago

Yes, it's still "smaller" than cloning the full repos - and we exploit the lexibank convention of naming the Wordlist metadata file cldf-metadata.json anyway.

xrotwang commented 3 years ago

If we wanted to be on the safe side - regarding versions - also when lexibank.csv is updated, we could compare size of Zenodo downloads with size of the local directories. The size of the zipfiles is available from the OAI records.

johenglisch commented 3 years ago

I don't know… Sizes feel a bit unstable. If the sizes are different, we know that we need to re-download, but if the sizes are the same, it might very well be that someone just switched some IDs around or fixed a typo. Chances are slim but if it does happen finding out why people get different output on different computers might turn into a wild goose chase…

xrotwang commented 3 years ago

Yes, a global "reload all" switch might be the better option here.

johenglisch commented 3 years ago

I guess one could look for a different piece of information (inside .zenodo.json for instance) and compare that to the record. But I don't know if that's actually worth the effort – people don't tend to re-download the data all that often…

xrotwang commented 3 years ago

And even if they do, it's not GBs of data ...

johenglisch commented 3 years ago

Btw, is there anything left to do for this PR or can it be merged?

xrotwang commented 3 years ago

I think it's good to be merged, but was a bit hesitant because the "cloned repositories" method might be easier when we are still in a phase where things may need corrections here and there?

LinguList commented 3 years ago

I liked the cloned repositories in our normal run, also for developmental reasons. However, we have 100 datasets for testing now, any future addition would also have to be checked and can be done with our workflow, so for the workflow of playing with the data, I do not see a problem so far. What might be good is to run the workflow once for all data with plots, to see if this works well?

LinguList commented 3 years ago

It would be good to decide ideally early next week if we make this PR a part of the new version or now. I'd opt for releasing the current version as version 0.1 first (I added the .zenodo.json already), and later take this as an updated version.

xrotwang commented 3 years ago

Sounds like a good plan

LinguList commented 3 years ago

I don't find the repo on zenodo, although it is online. Do we have to wait here now?

xrotwang commented 3 years ago

? which repos?

LinguList commented 3 years ago

lexibank-analysed

LinguList commented 3 years ago

I wanted to make the 0.1 release after having made this one public.

xrotwang commented 3 years ago

ah, ok. will wire it up.

LinguList commented 3 years ago

If you can make the release, would be perfect. I'd then insert the link in teh paper and finalize the paper on Monday (figures are already done, but not yet inserted, references as well).

xrotwang commented 3 years ago

https://github.com/lexibank/lexibank-analysed/releases/tag/v0.2

v0.2 - because v0.1 had invalid .zenodo.json :)

LinguList commented 3 years ago

Sorry, I was so sure that this .zenodo.json was right this time :(

LinguList commented 2 years ago

We should merge this PR now, so we can proceed!

FredericBlum commented 6 months ago

@LinguList This was never merged, and now there seem to be some conflicts. If we still want to update this, I could write the changes here into a more recent version of the cldfbench script.

LinguList commented 6 months ago

We should merge first via cmd line, I'd try to do so later. It is but one file.

chrzyki commented 6 months ago

Conflict resolved in https://github.com/lexibank/lexibank-analysed/pull/33/commits/405c40022576b65075321031a812a4fddfbe25f4.

LinguList commented 6 months ago

Danke!!!