Closed johenglisch closed 6 months ago
You can also download the complete repos, using Record.download
, rather than download_dataset
. But getting just the CLDF should work in these cases. (It might fail if there's more than one CLDF dataset in the same repository, which is the case for lexibank/hindukush
- but that's not part of LexiCore, I think)
Well, if downloading the whole bench is more robust, then we might as well do it – reduces the chances of surprises later on.
Yes, it's still "smaller" than cloning the full repos - and we exploit the lexibank convention of naming the Wordlist metadata file cldf-metadata.json
anyway.
If we wanted to be on the safe side - regarding versions - also when lexibank.csv is updated, we could compare size of Zenodo downloads with size of the local directories. The size of the zipfiles is available from the OAI records.
I don't know… Sizes feel a bit unstable. If the sizes are different, we know that we need to re-download, but if the sizes are the same, it might very well be that someone just switched some IDs around or fixed a typo. Chances are slim but if it does happen finding out why people get different output on different computers might turn into a wild goose chase…
Yes, a global "reload all" switch might be the better option here.
I guess one could look for a different piece of information (inside .zenodo.json
for instance) and compare that to the record. But I don't know if that's actually worth the effort – people don't tend to re-download the data all that often…
And even if they do, it's not GBs of data ...
Btw, is there anything left to do for this PR or can it be merged?
I think it's good to be merged, but was a bit hesitant because the "cloned repositories" method might be easier when we are still in a phase where things may need corrections here and there?
I liked the cloned repositories in our normal run, also for developmental reasons. However, we have 100 datasets for testing now, any future addition would also have to be checked and can be done with our workflow, so for the workflow of playing with the data, I do not see a problem so far. What might be good is to run the workflow once for all data with plots, to see if this works well?
It would be good to decide ideally early next week if we make this PR a part of the new version or now. I'd opt for releasing the current version as version 0.1 first (I added the .zenodo.json already), and later take this as an updated version.
Sounds like a good plan
I don't find the repo on zenodo, although it is online. Do we have to wait here now?
? which repos?
lexibank-analysed
I wanted to make the 0.1 release after having made this one public.
ah, ok. will wire it up.
If you can make the release, would be perfect. I'd then insert the link in teh paper and finalize the paper on Monday (figures are already done, but not yet inserted, references as well).
https://github.com/lexibank/lexibank-analysed/releases/tag/v0.2
v0.2 - because v0.1 had invalid .zenodo.json :)
Sorry, I was so sure that this .zenodo.json was right this time :(
We should merge this PR now, so we can proceed!
@LinguList This was never merged, and now there seem to be some conflicts. If we still want to update this, I could write the changes here into a more recent version of the cldfbench script.
We should merge first via cmd line, I'd try to do so later. It is but one file.
Danke!!!
Note:
cldfzenodo
seems to only download the cldf data itself, not the entire bench.cldfbench cldfmake
seems to work just fine, but I don't know if this could break any other commands.