`Extractor:attibutes:usage` needs documentation and schema

PeterKraus commented 1 year ago

During the last office hours, we have settled on a first draft for the two variants of usage:

usage: "cli: executable {{ file_type }} {{ file_path }} {{ output_file }}" for executing an Extractor from shell, with the obvious meanings of the parameters, returning the data as a file, and
usage: "python: project.module.function( {{ file_type }, {{ file_path }} )" for executing an Extractor from python, returning the data as an in-memory object

This functionality is currently:

undocumented: there is no explanation in the Extractor schema on how to compose these usage strings,
unvalidated: the Registry machinery does not have a way to validate the usage string for syntax-correctness
inconsistent: I would suggest renaming the {{ file_path }} pattern to {{ input_file }} for consistency with {{ output_file }}

ml-evs commented 1 year ago

Do we prefer a single string e.g., cli: executable {{ file_type }} {{ file_path }} {{ output_file }} or

usage:
    - method: cli
      command: executable {{ file_type }} {{ file_path }} {{ output_file }}
    - method: python
      command: project.module.function( {{ file_type }, {{ file_path }} )

?

I can see benefits to both. The former would allow for structural validation via regexps directly (might be a slightly tasty regexp with conditionals...), the latter lets you filter usage more easily by method without needing to parse the string. It's not yet clear to me how to define the metaschema for {{ file_path }} etc in our current approach, perhaps we have to have a regexp that looks inside {{ }} and checks that the value is one of an enumerated list? Then we just bung all of that in the description of the field?

ml-evs commented 1 year ago

Either way, I think we can make use of LinkML's Structure Patterns here (although the braces will be confusing), so we could literally write (in the Python case) a pattern like: ^python]:{project}.{module}.{function}({arguments})$ with separate sub patterns for project (I guess actually package?), module and function that ban dodgy chars and allow arbitrary nesting within module etc. I'm not sure how these get converted to pydantic models but can investigate depending on what we decide above.

PeterKraus commented 1 year ago

I think the dict(method=..., command=...) syntax is cleaner than shoving it all into one string.

Also, I'm not super set on {{ }} parameters. We might as well go with $-prefixed strings. Currently, I can think of the following 4 functions we should support, in light of the feature levels we're discussing (#13):

Input filetype. Currently {{ file_type }}, but might as well be $filetype or $input_filetype; linked to Extractor:attributes:supported_filetypes
Input path. Currently {{ file_path }}, but might as well be $input_path, $input_file or something like that.
Output path. Currently {{ output_file }}, which is already inconsistent with the input path specification; something like $output_path or $output_file should work well. Implies Extractor:attributes:output_format is defined, so that we know how to at least open the provided file.
Output filetype. This can be useful if an Extractor can provide several different outputs, such as what @wardlt showed Scythe is already doing; also related to Extractor:attributes:output_format. Should be consistent with the input specification, or at least prefixed, e.g. $output_filetype.

PeterKraus commented 1 year ago

Following office hours on the 5th of May, we've decided to use the following templating strings for version 0.2:

{{ input_type }} specifies input FileType, which must be one of the supported_filetypes of this Extractor
{{ input_path }} specifies the location of the resource to be extracted; language here is intentionally vague to allow parsing of files, folders, URL's etc. at a later stage
{{ output_path }} specifies the location where the extracted (meta)-data is saved (on disk, for now) following a successful extraction
{{ output_type }} for Extractors supporting multiple output filetypes (in supported_output_filetypes), this templating string allows for choosing one of them

This also implies a new Extractor:attribute of supported_output_filetypes, which is intended for rather "generic types" (json, NetCDF, pickle, csv, text...)

marda-alliance / metadata_extractors_schema

`Extractor:attibutes:usage` needs documentation and schema #19