Closed PeterKraus closed 1 year ago
Do we prefer a single string e.g., cli: executable {{ file_type }} {{ file_path }} {{ output_file }}
or
usage:
- method: cli
command: executable {{ file_type }} {{ file_path }} {{ output_file }}
- method: python
command: project.module.function( {{ file_type }, {{ file_path }} )
?
I can see benefits to both. The former would allow for structural validation via regexps directly (might be a slightly tasty regexp with conditionals...), the latter lets you filter usage more easily by method without needing to parse the string. It's not yet clear to me how to define the metaschema for {{ file_path }} etc in our current approach, perhaps we have to have a regexp that looks inside {{ }}
and checks that the value is one of an enumerated list? Then we just bung all of that in the description of the field?
Either way, I think we can make use of LinkML's Structure Patterns here (although the braces will be confusing), so we could literally write (in the Python case) a pattern like: ^python]:{project}.{module}.{function}({arguments})$
with separate sub patterns for project
(I guess actually package
?), module
and function
that ban dodgy chars and allow arbitrary nesting within module
etc. I'm not sure how these get converted to pydantic models but can investigate depending on what we decide above.
I think the dict(method=..., command=...)
syntax is cleaner than shoving it all into one string.
Also, I'm not super set on {{ }}
parameters. We might as well go with $
-prefixed strings. Currently, I can think of the following 4 functions we should support, in light of the feature levels we're discussing (#13):
{{ file_type }}
, but might as well be $filetype
or $input_filetype
; linked to Extractor:attributes:supported_filetypes
{{ file_path }}
, but might as well be $input_path
, $input_file
or something like that.{{ output_file }}
, which is already inconsistent with the input path specification; something like $output_path
or $output_file
should work well. Implies Extractor:attributes:output_format
is defined, so that we know how to at least open the provided file.Extractor
can provide several different outputs, such as what @wardlt showed Scythe is already doing; also related to Extractor:attributes:output_format
. Should be consistent with the input specification, or at least prefixed, e.g. $output_filetype
.Following office hours on the 5th of May, we've decided to use the following templating strings for version 0.2:
{{ input_type }}
specifies input FileType
, which must be one of the supported_filetypes
of this Extractor
{{ input_path }}
specifies the location of the resource to be extracted; language here is intentionally vague to allow parsing of files, folders, URL's etc. at a later stage{{ output_path }}
specifies the location where the extracted (meta)-data is saved (on disk, for now) following a successful extraction{{ output_type }}
for Extractors
supporting multiple output filetypes (in supported_output_filetypes
), this templating string allows for choosing one of themThis also implies a new Extractor:attribute
of supported_output_filetypes
, which is intended for rather "generic types" (json, NetCDF, pickle, csv, text...)
During the last office hours, we have settled on a first draft for the two variants of
usage
:usage: "cli: executable {{ file_type }} {{ file_path }} {{ output_file }}"
for executing anExtractor
from shell, with the obvious meanings of the parameters, returning the data as a file, andusage: "python: project.module.function( {{ file_type }, {{ file_path }} )"
for executing anExtractor
from python, returning the data as an in-memory objectThis functionality is currently:
Extractor
schema on how to compose theseusage
strings,Registry
machinery does not have a way to validate theusage
string for syntax-correctness{{ file_path }}
pattern to{{ input_file }}
for consistency with{{ output_file }}