marda-alliance / metadata_extractors_schema

Archive of MaRDA Metadata Extractors Schema. See datatractor/schema for the current repository.
https://github.com/datatractor/schema
MIT License
6 stars 1 forks source link

`Extractor:attibutes:usage` needs documentation and schema #19

Closed PeterKraus closed 1 year ago

PeterKraus commented 1 year ago

During the last office hours, we have settled on a first draft for the two variants of usage:

This functionality is currently:

ml-evs commented 1 year ago

Do we prefer a single string e.g., cli: executable {{ file_type }} {{ file_path }} {{ output_file }} or

usage:
    - method: cli
      command: executable {{ file_type }} {{ file_path }} {{ output_file }}
    - method: python
      command: project.module.function( {{ file_type }, {{ file_path }} )

?

I can see benefits to both. The former would allow for structural validation via regexps directly (might be a slightly tasty regexp with conditionals...), the latter lets you filter usage more easily by method without needing to parse the string. It's not yet clear to me how to define the metaschema for {{ file_path }} etc in our current approach, perhaps we have to have a regexp that looks inside {{ }} and checks that the value is one of an enumerated list? Then we just bung all of that in the description of the field?

ml-evs commented 1 year ago

Either way, I think we can make use of LinkML's Structure Patterns here (although the braces will be confusing), so we could literally write (in the Python case) a pattern like: ^python]:{project}.{module}.{function}({arguments})$ with separate sub patterns for project (I guess actually package?), module and function that ban dodgy chars and allow arbitrary nesting within module etc. I'm not sure how these get converted to pydantic models but can investigate depending on what we decide above.

PeterKraus commented 1 year ago

I think the dict(method=..., command=...) syntax is cleaner than shoving it all into one string.

Also, I'm not super set on {{ }} parameters. We might as well go with $-prefixed strings. Currently, I can think of the following 4 functions we should support, in light of the feature levels we're discussing (#13):

PeterKraus commented 1 year ago

Following office hours on the 5th of May, we've decided to use the following templating strings for version 0.2:

This also implies a new Extractor:attribute of supported_output_filetypes, which is intended for rather "generic types" (json, NetCDF, pickle, csv, text...)