`Extractor` specifying type of output (meta)-data

marda-alliance / metadata_extractors_schema

Archive of MaRDA Metadata Extractors Schema. See datatractor/schema for the current repository.

https://github.com/datatractor/schema

MIT License

6 stars 1 forks source link

`Extractor` specifying type of output (meta)-data #26

Closed PeterKraus closed 11 months ago

PeterKraus commented 1 year ago

Following up from the ELN Roundtable, an important point was raised, that there should be a way to force the Extractor to only return metadata.

In my understanding, this is composed of three steps:

allowing downstream to filter Extractors which return meta-only, or meta+data,
allowing Extractor writers to specify both meta-only and meta+data usages in a simple way,
defining what the meta part really is, which is most likely out of scope of this WG.

The short and simple way to do this would be to extend the usage schema to indicate whether a given entry returns meta-only or meta+data. However, this might require two usages for each Extractor.

Tagging @steffenbrinckmann

ml-evs commented 11 months ago

Closed by #33

ml-evs commented 8 months ago

Riffing on this, now we have the API harness that can deal with pandas/xarray objects, it would be nice to allow extractors to specify generic packages that are required to be installed to understand the outputs. Currently we just have a couple of defined "common" formats but probably saying you need pandas/numpy/xarray or w/e, would be helpful, could also be extended to cover the idea of returning raw JSON.

ml-evs commented 6 months ago

Just thinking about this in terms of concrete next steps for schemas to support/encourage, one that might map quite nicely to some of our exisiting filetypes are the Allotrope simple models for certain techniques (which are semantic JSON schemas), e.g., powder XRD: https://gitlab.com/allotrope-public/asm/-/blob/main/json-schemas/adm/x-ray-powder-diffraction/REC/2021/12/x-ray-powder-diffraction.embed.schema.json I'm not sure how widely these are used in academia atm but there's definitely industry buy-in. There's lots of a gaps (e.g., this XRD schema doesn't define peaks...) but might be a starting point (TGA for example has peaks defined: https://gitlab.com/allotrope-public/asm/-/blob/main/json-schemas/adm/thermogravimetric-analysis/REC/2021/12/thermogravimetric-analysis.schema.json)