Closed xrotwang closed 3 years ago
As far as I see, the advantage of making this a plugin for cldfbench is that the infrastructure is unified. We have quite a few examples where cldfbench's cli is extended by a given cldf dataset, so if we assume that we introduce this rather straightforward commandline syntaxt to people working with cldfbench, they may more easily adapt to it.
One problem we have to discuss is the namespace. I wanted to switch to "lexibank1" or similar, addressing this as a specific release of lexibank, but using only "lexibank" will clash with "pylexibank".
If we go for "lexibank1" (which I think sounds better then "lexibank-study"), we could later do something similar for "lexibank2" if we make a new release, similar to clics1, clics2, clics3.
While the main use case of cldfbench
is indeed the creation of CLDF datasets, it also provides a framework to create cli commands, with uniform access to catalog data.
That said, I think we actually should create a CLDF dataset here, containing the computed features.
Ah, that would indeed be cool! Then, the commands we have to make the
two sub-datasets lexicore and clics would be called instead by
cldfbench makecldf
, right? and the download would download the
datasets to some raw directory?
In fact, I think, datasets/
should be replaced with raw/
and all the plotting be based on the computed CLDF output.
And those generated json files could go into etc/
.
Now that you suggest it, this makes complete sense to me. Actually very cool and this would also be much more consistent.
Now that you suggest it, this makes complete sense to me. Actually very cool and this would also be much more consistent.
I mean the general idea: this is a CLDF Structure Dataset. The download would download cldf datasets into raw
, and the makecldf
would accordingly load all data and compute the features, depending on the collection, which is based on the information available in the lexibank.tsv
file which we also place in raw
, and which would likewise be updated from collabutils
.
This may even be an idea for the future CLICS dataset workflow, if we can think of a good way to represent the network-data in CLDF.
lexibank.tsv
could also go into etc
, I guess, since it has different provenance than the datasets in raw: it's configuration data, rather than input data.
Makes sense. Question is: should we merge now, and then work on the new way to code this up?
And if so, do you agree with lexibank1 or do you have another suggestion for new names that would be useful for the future of lexibank data, taking more versions into account?
Yes, merging then refactor the commands. And I actually liked "lexibank-study" - or maybe "lexibank-analysis"?
Lexibank-analysis maybe. I use "study" as a synonym for "paper", since study does not imply that the paper has appeared. But if we make this a dataset, it would be good to emphasize this, and "analysis" would do so. Future updates would then just have a new version, I assume?
We could call it "lexibank-analysed", maybe.
Yep. That sounds good to me! And we would by now (but thinks change of course) assume that a lexibank version 2 in the future, which contains more features and the like, would have a new version to it, right? So future analyses would all go here.
yes. This could grow into whatever can be derived from the lexical datasets.
Looks good, but I'm wondering why the project should be switching to cldfbench in the first place?
I always thought a cldfbench is a template for a single CLDF project, i.e. for reproducibly converting
raw/
data intocldf/
data.The lexibank study, however, collects different datasets from all around github and does statistics and plotting on them, which seems like a different use case.
Unless the plan to compile the findings into a CLDF dataset later on? Or are those
cldfbench
commands supposed to work for other cldfbenches, too (which would be hindered by the fact that the list of datasets is hard-coded in thelexibank/data
folder)?