grambank / pygrambank

Apache License 2.0
4 stars 1 forks source link

zip value table #96

Closed johenglisch closed 2 years ago

johenglisch commented 2 years ago

Companion PR to https://github.com/glottobank/grambank-cldf/pull/124 in a combined effort to fix https://github.com/glottobank/grambank-cldf/issues/123

xrotwang commented 2 years ago

If we passed args.writer in https://github.com/glottobank/grambank-cldf/blob/17cb9cc1b4563a2c9ffd4c867788859780c6a476/cldfbench_grambank.py#L32 and appended rows to args.writer.objects, we could delegate the Dataset.write call to cldfbench.

johenglisch commented 2 years ago

Hm, that feels odd because I just noticed today that pygrambank doesn't actually depend on cldfbench… It's probably not a big deal because cldf.create is only called from a bench but my gut still doesn't like it.

Come to think of it, we could also do the opposite: Have cldf.create return the dictionary of CLDF tables and let the cldfbench do the write call. That would feel mildly less coupled imo.

xrotwang commented 2 years ago

I think I was just too lazy to move the complete cldf.create function into the cmd_makecldf method (probably also because testing for code in a cldfbench isn't straightforward). But ideally, that's what we should do, i.e. the cldfbench would depend on pygrambank to programmatically access the raw data - and not the other way round.

johenglisch commented 2 years ago

Hm, I guess I would kind of lean to the opposite there as well:

In a perfect world, I would want all I/O (a.k.a. the stuff that's a PITA to test anyways) to happen directly in the CLDFbench and only pass generic Python data structures to the processing code inside pygrambank.cldf.

(This would mean splitting up the create function into smaller bits that can be called from a for sheet in original_sheets/*.csv loop.)

xrotwang commented 2 years ago

I guess that's what I meant: https://github.com/grambank/pygrambank/blob/53ab400a98c2dc2cb6242217cf9d498d85befc39/src/pygrambank/cldf.py#L102-L349 should be part of (not called from) cldfbench_grambank.py.

xrotwang commented 2 years ago

Then stuff like the add_provenance calls would become obsolete, too. So cldf.create as part of the cldfbench would be somewhat shorter, too.

xrotwang commented 2 years ago

Regarding testing: The idea here is that the only (or at least main) responsibility of a cldfbench.Dataset is creating the CLDF data. So most of the testing can be delegated to running cldf validate on the resulting dataset - or adding checks to something like https://github.com/cldf-datasets/imtvault/blob/main/test.py

johenglisch commented 2 years ago

So, to formulate a plan:

  1. I think we can just go and merge these PRs to achieve the immediate goal of getting the zipped value tables working.

  2. I should go through pygrambank.create and put everything into its proper place, so we can avoid having to fix issues with cross-package PRs like this in the future.

xrotwang commented 2 years ago

Yes, sounds good.