lexibank / lexibank-analysed

Study on lexibank data (presenting the lexibank dataset).
Creative Commons Attribution 4.0 International
9 stars 3 forks source link

refactored the commands to cldfbench subcommands #22

Closed xrotwang closed 3 years ago

johenglisch commented 3 years ago

Looks good, but I'm wondering why the project should be switching to cldfbench in the first place?

I always thought a cldfbench is a template for a single CLDF project, i.e. for reproducibly converting raw/ data into cldf/ data.

The lexibank study, however, collects different datasets from all around github and does statistics and plotting on them, which seems like a different use case.

Unless the plan to compile the findings into a CLDF dataset later on? Or are those cldfbench commands supposed to work for other cldfbenches, too (which would be hindered by the fact that the list of datasets is hard-coded in the lexibank/data folder)?

LinguList commented 3 years ago

As far as I see, the advantage of making this a plugin for cldfbench is that the infrastructure is unified. We have quite a few examples where cldfbench's cli is extended by a given cldf dataset, so if we assume that we introduce this rather straightforward commandline syntaxt to people working with cldfbench, they may more easily adapt to it.

One problem we have to discuss is the namespace. I wanted to switch to "lexibank1" or similar, addressing this as a specific release of lexibank, but using only "lexibank" will clash with "pylexibank".

If we go for "lexibank1" (which I think sounds better then "lexibank-study"), we could later do something similar for "lexibank2" if we make a new release, similar to clics1, clics2, clics3.

xrotwang commented 3 years ago

While the main use case of cldfbench is indeed the creation of CLDF datasets, it also provides a framework to create cli commands, with uniform access to catalog data. That said, I think we actually should create a CLDF dataset here, containing the computed features.

LinguList commented 3 years ago

Ah, that would indeed be cool! Then, the commands we have to make the two sub-datasets lexicore and clics would be called instead by cldfbench makecldf, right? and the download would download the datasets to some raw directory?

xrotwang commented 3 years ago

In fact, I think, datasets/ should be replaced with raw/ and all the plotting be based on the computed CLDF output.

johenglisch commented 3 years ago

And those generated json files could go into etc/.

LinguList commented 3 years ago

Now that you suggest it, this makes complete sense to me. Actually very cool and this would also be much more consistent.

LinguList commented 3 years ago

Now that you suggest it, this makes complete sense to me. Actually very cool and this would also be much more consistent.

I mean the general idea: this is a CLDF Structure Dataset. The download would download cldf datasets into raw, and the makecldf would accordingly load all data and compute the features, depending on the collection, which is based on the information available in the lexibank.tsv file which we also place in raw, and which would likewise be updated from collabutils.

This may even be an idea for the future CLICS dataset workflow, if we can think of a good way to represent the network-data in CLDF.

xrotwang commented 3 years ago

lexibank.tsv could also go into etc, I guess, since it has different provenance than the datasets in raw: it's configuration data, rather than input data.

LinguList commented 3 years ago

Makes sense. Question is: should we merge now, and then work on the new way to code this up?

LinguList commented 3 years ago

And if so, do you agree with lexibank1 or do you have another suggestion for new names that would be useful for the future of lexibank data, taking more versions into account?

xrotwang commented 3 years ago

Yes, merging then refactor the commands. And I actually liked "lexibank-study" - or maybe "lexibank-analysis"?

LinguList commented 3 years ago

Lexibank-analysis maybe. I use "study" as a synonym for "paper", since study does not imply that the paper has appeared. But if we make this a dataset, it would be good to emphasize this, and "analysis" would do so. Future updates would then just have a new version, I assume?

xrotwang commented 3 years ago

We could call it "lexibank-analysed", maybe.

LinguList commented 3 years ago

Yep. That sounds good to me! And we would by now (but thinks change of course) assume that a lexibank version 2 in the future, which contains more features and the like, would have a new version to it, right? So future analyses would all go here.

xrotwang commented 3 years ago

yes. This could grow into whatever can be derived from the lexical datasets.