Refactor cldf creation such that it can be called from a cldfbench repos

xrotwang commented 3 years ago

Moving the CLDF to grambank/grambank should be done by turning it into cldfbench enabled repository. Thus, CLDF creation should be implemented as makecldf method, using functionality from pygrambank.

xrotwang commented 3 years ago

@SimonGreenhill I'm in the process of switching the CLDF creation to a cldfbench-based process. Now, this process needs access to two other repositories: glottobank/Grambank and grambank/grambank.wiki. Right now, I have this working by prompting the user for the locations of clones of these repos. But it would seem like this is a perfect use case for git submodules. A .gitmodules file like this

[submodule "Grambank"]
    path = raw/Grambank
    url = https://github.com/glottobank/Grambank.git
[submodule "grambank.wiki"]
    path = raw/grambank.wiki
    url = https://github.com/grambank/grambank.wiki

might do the trick.

Should we have a go - or could this be too intimidating?

xrotwang commented 3 years ago

So far, I really like the submodules setup. It also seems to make sense semantically: The CLDF repos is our tool to make release versions of the Grambank data - and do this by fetching data from HEAD of glottobank/Grambank and grambank/grambank.wiki.

xrotwang commented 3 years ago

The only downside I could see, is that people may want to clone the CLDF repos - but might not have access to glottobank/Grambank. But then, the CLDF repos is only there to put together releases for Zenodo - so it wouldn't have any interesting history or other items that would only be available through a clone - rather than an export or archived release on Zenodo.

xrotwang commented 3 years ago

@HedvigS You might want to look at this issue as well. I don't know how often you might want to / have to run the CLDF creation in the future, but if you do, you'll be impacted by this.

xrotwang commented 3 years ago

Reading up a bit more on git submodules, it seems as if most of the disadvantages people typically mention, don't apply to us. In our case, the repos we pull in via submodules would be read-only clones. The only potential confusion I can see emerges from having two clones of glottobank/Grambank on disk. But it may also turn out to be beneficial to have one "working" copy of this repos - often checked out to a branch for a particular PR - and one implicit copy, as submodule of the CLDF repos.

SimonGreenhill commented 3 years ago

hmm, I think this makes sense. A couple of questions:

Is it possible to work without the submodule if the user doesn't have access to one of the submods (is that a concern? anyone who is making the CLDF should have access to the wiki etc as well, right?)
what if I wanted to change something in the wiki and see it come through in the cldf? I wouldn't be able to make local changes to my wiki clone and run this, but have to get the change accepted into grambank-wiki first? (I don't know if this would ever happen but the 'read only' nature of submodules would mean it can't, right?

xrotwang commented 3 years ago

Ad 1.: I'm not sure what "work" would mean here. The CLDF repository has one single job AFAICT: Shoving grambank releases to Zenodo. With the cldfbench setup, it will acquire a second job: Creating the CLDF dataset. The first job is not something anyone but us is going to do. The second one requires access to the submodules, i.e. doesn't make sense without such access.

Ad 2.: Oh, submodules are not read-only by nature. But being able to treat them as such makes some typical probems go away. So you can totally make changes in the wiki locally - and then either discard (via checkout) or commit and push if you have the permissions.

SimonGreenhill commented 3 years ago

ok, sounds good :)

xrotwang commented 3 years ago

@HedvigS are you ok with this? If so, I'd push the changes to pygrambank and to this repos, and CLDF creation will then be done with a new (but simpler) command.

xrotwang commented 3 years ago

@HedvigS thoughts?

HedvigS commented 3 years ago

I'm sorry I didn't see this last week. I trust you to make good calls here.

HedvigS commented 3 years ago

If I'm honest, I don't really understand what interest anyone but us would have in using most/all commands of pygrambank. I don't fully understand what users we are expecting.

All I need is to be able to screen PRs for the behind the scenes repos (glottobank/Grambank) and make certain updates related to the ms and first release (wherever those things are kept). Right now, we also need to be able to push changes to the website (because of wiki updates) and this smaller change involved in this PR to parameters.csv without affecting the Values themselves. Ideally in future releases, that shouldn't be de-coupled at all but all done at once and all changes wait until the next version.

I understand how the proper way is to archive things with Zenodo, but for convenience I would appreciate if we also kept the cldf GitHub repos. I don't understand really what you're talking about with two glottobank/Grambank.

Should we just schedule a meeting to talk it through? I'm getting quite confused, and I don't fully get what purpose all of this structure is meant to serve.

xrotwang commented 3 years ago

Only the cldf command is affected by this change, and it is replaced by something equivalent. Also, going back to the old behaviour should be simple, because you have installed pygrambank from a clone, so you could just check out the commit before my changes, ok?

HedvigS commented 3 years ago

Yes, I understood that only cldf was affected. I just didn't fully follow the basis of that, other than that sub-modules apparently are neat. I'm sure it's a great reason, I just don't follow it. I also still don't understand who the imagined users are, so that makes it tricky to follow along with changes.

I don't want to make use of the old behaviour if that's no longer the way things are. I'd rather know how to use the new behaviour, or for now kindly ask that someone else runs the necessary commands so that the changes in glottobank/Grambank #1193 and #36 are implemented in grambank/grambank-cldf parameters.csv.

HedvigS commented 3 years ago

For example, I don't understand " people may want to clone the CLDF repos - but might not have access to glottobank/Grambank.". Why would they need access to glottobank/Grambank?

Why would anyone besides essentially us three want to cldf-render from glottobank/Grambank? Isn't glottobank/Grambank strictly going to be non-public, a "behind the scenes" repos?

HedvigS commented 3 years ago

I'm not saying this isn't right, I'm just saying I don't understand it so I can't really evaluate what's going on.

I don't want to use the old way of doing things if that is now superseded. Please either help me do what I need to do now, or point me to instructions for how to go about doing it the new way. If that isn't possible, I guess I will go with plan B and check out an earlier version of the reposes and do it the old way.

xrotwang commented 3 years ago

The essential part here is not the use of submodules, but the use of cldfbench to trigger the CLDF creation. This is useful, because

that's how we do that for many more datasets, so more people in the DLCE IT group will know how it works,
the cldfbench makecldf command includes metadata in a standardized way into the CLDF dataset, thus making the data more similar to other datasets.

Since these advantages only apply to the "creating CLDF for a proper release" use case, it would seem totally justified for you to stick with the old behaviour.

xrotwang commented 3 years ago

Oh, and using submodules seems just the most "natural" way to provide the "raw" data where cldfbench expects it - see https://github.com/cldf/cldfbench/#workflow .

HedvigS commented 3 years ago

Okay, right.

So, for now the best thing for the ms analysis and the wiki updates to the clld website is to use an older version of the repos?

xrotwang commented 3 years ago

Or you let me do it.

Btw.: To get wiki updates in the clld website can only be done by me, anyway.

grambank / pygrambank

Refactor cldf creation such that it can be called from a cldfbench repos #33