marda-alliance / metadata_extractors_registry

Archive. See Datatractor Yard, below:
https://github.com/datatractor/yard
MIT License
6 stars 6 forks source link

`FileType`: Include example files for the filetype. #9

Closed PeterKraus closed 1 year ago

PeterKraus commented 1 year ago

The most important thing in our "schema" was perhaps the link to an example file (as simply the name of the instrument and extension oftentimes doesn't describe much). Perhaps one could also consider adding it here.

To make this possible, I once started a "chemical files registry" here, where I use also a yml schema similar to yours: https://github.com/kjappelbaum/chemical-files-registry/blob/master/fileDescriptions/analyticalMethods/thermogravimetricAnalysis/ta-txt/description.yml.

I didn't have any time to work on this, but the idea was to collect example files and link to them (and the filetype schema) from the parser registry.

_Originally posted by @kjapplebaum in https://github.com/marda-alliance/metadata_extractors_schema/issues/2#issue-1501880835_

ml-evs commented 1 year ago

Circling back around to this, in terms of design/deployment we could just create another GitHub repo with LFS enabled, dump some files in there and then use the raw GitHub links in the registry. The idea is that extractors can be validated against the example files in the CI (eventually including validation against the schema returned by the extractor).

There is perhaps a slight semantic difference between example files (demonstrative, pedagogical and not exhaustive) and test cases (awkward on purpose, ideally exhaustive), though, and we should clarify this before asking for example files... especially in cases where an extractor can parse a given subset of a file types features (and maybe errors/fails if other features are present -- we have a place for extractors to provide these caveats but automatically validating them will be tricky).

PeterKraus commented 1 year ago

As for your first point, I think the registry might be the correct place to store all of the example/test files. When I add a new Filetype, it's be helpful to be able to do it in a single PR (i.e. attach also the example files), rather than having to do it in two spots. I can have a look on how LFS works next week - I never used it before.

As for your semantics difference, I think it might be helpful to sort the "test files" into folders (one per Filetype), and the CI in the api repo can be set up to run all test files in the Filetype folder for each Filetype supported by the Extractor. We can handle the cases of an Extractor not supporting a certain test file by x-fail-ing individual cases, and figure out a more robust solution later.

PeterKraus commented 1 year ago

Closed in #30.