ropenscilabs / deposits

R Client for access to multiple data repository services
https://docs.ropensci.org/deposits/
Other
38 stars 3 forks source link

Convert DCMI metadata to host-specific form prior to upload #39

Closed mpadge closed 1 year ago

mpadge commented 1 year ago

DONE:

TODO:

nothing else

mpadge commented 1 year ago

Current task:

Interesting and relavant blog post: https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-validate/. plus https://efdn.notion.site/Pilot-Study-1bf3e3be6bf34a2eb8156ddf98d3fa67

mpadge commented 1 year ago
* [ ]  Develop system for generalising the representation of metadata structures.

That is now json-schemas, for which I made this "jsonschema" repo. It's around 100 times faster than ropensci/jsonvalidate, and more customisable and flexible, which is going to be important here. But ... It bundles this C++ library for json parsing which includes code which is not CRAN-compliant. In particular, it uses a lot of pragmas to suppress various kinds of compiler warnings, and such suppression is not accepted on CRAN. Yet removing those generates a heap of warnings, which is also not acceptible on CRAN...

So after all of that work, it might be necessary, at least for now, to revert to using jsonvalidate instead.

mpadge commented 1 year ago
* [ ]  Develop system for generalising translation between metadata structures

This seems like one of the best candidates: https://www.mdpi.com/2076-3417/11/24/11978, or alternatively https://github.com/marksparkza/json-translation-vocabulary. There are links from json-schema.org to "schema-to-schema" systems, but none of these currently look promising.

JSON Schemas with Semantic Annotations Supporting Data Translation

This is a promising effort, but is really geared towards translating within items of JSON objects, rather than translating to potentially different object stuctures. All of their examples focus on translating units of measurement for numeric properties, with no indication anywhere that they conceived of translation between different structures. It might be possible to extend their system to include these kinds of structural translations, but with their suggested vocabulary restricted to properties and values, it would effectively amount to re-inventing an entire system using theirs as inspiration. And that seems more work than ought to be necessary here. Their key statement is:

It was not considered the “object” property, since the annotation method is used for annotating data carrying nodes, similar to the previous works for XML Schema. It was also not considered upper-level elements because this may lead to problems if the structure of the ontology and the schema differs. Therefore, the complete semantics is defined on the leaf level, using expressive annotation paths and group identifiers.

Verdict

:heavy_multiplication_x:

json-translation-vocabulary

The vocabulary can be summarised as:

  1. Source - do not transform;
  2. Concat - Concanate muliple array types
  3. Sep - Separate strings into multiples strings
  4. Filter - applied to both strings and objects
  5. Cast - convert between base JSON types

Those are useful, but a lot of the operations needed here would then amount to complex Filter-and-Concat-type stuff, plus the need to apply conditional logic such as enums. It's also another case where the entities envisioned to be translated are instances of value, so still doesn't really envision structural translations.

Verdict

:question: Use this as inspriation, especially as a benchmark for how to keep translation vocabularies simple, and build a more extensible system from that.

mpadge commented 1 year ago

The above commit uses the JSON translation files in inst/extdata/<service> to simplify metadata translation down to the single file commited there. That is only 150 lines, and will utlimately replace previous files:

so just over 1/3 the size, and far simpler to understand and maintain. Most translation logic has now been ported off to the external JSON files.

mpadge commented 1 year ago

All done now. The name of this issue is a misnomer, as most of the above commits served to restructure the entire way that both DCMI and scheme-specific metadata are represented, translated, and validated. It's now all done via the json schemas in inst/extdata, with the codebase being almost entirely general, and free of service-specific routines.

One final

TODO DONE:

mpadge commented 1 year ago

One of the potential solutions that was investigated here was to use jq as the translation engine between the various JSON metadata structures, using some kind of translation vocabulary such as https://github.com/marksparkza/json-translation-vocabulary/blob/main/vocabulary.json. This is possible, but would require lots of shell scripts like this:

HAS_DESC=$(cat metadata.json | jq '.description // empty')
if [[ ! -z "$HAS_DESC" ]]; then
  DESC=$(cat metadata.json | jq '.description')
  TXT="## Description\n${DESC}\n"

That would have been truly independent of implementation language, but would also end up more complex and less easy to maintain than the solution ultimately adopted of coding equivalent routines in R.