`Extractor:` some `attributes` should be required.

PeterKraus commented 1 year ago

The following Extractor:attributes should be set to be compulsory in the schema:

supported_filetypes: obviously, as otherwise it's not an Extractor. length of the list should be >= 1
license: for now, allowing SPDX and URI/URL. I think for the eventual frontend, everything that's not a SPDX might get a "custom" tag, to allow easy filtering by license.

The source_repository and especially usage entries are also good candidates for required attributes. However, requiring the first one would close the project to closed-source projects, and we might figure out a cleaner installation mechanism in the api than building from source (see https://github.com/marda-alliance/metadata_extractors_api/pull/5). Similarly, usage is a key attribute, but its syntax is not yet 100% defined (see #19 ).

ml-evs commented 1 year ago

The following Extractor:attributes should be set to be compulsory in the schema:
* `supported_filetypes`: obviously, as otherwise it's not an Extractor. length of the list should be >= 1

Agreed, though we might get an awkward situation where an extractor wants to report that it supports loads of file types that we don't have in the registry, in which case they might leave the field empty rather than adding arbitrary strings?

* `license`: for now, allowing SPDX and URI/URL. I think for the eventual frontend, everything that's not a SPDX might get a "custom" tag, to allow easy filtering by license.

:+1:

The source_repository and especially usage entries are also good candidates for required attributes. However, requiring the first one would close the project to closed-source projects,

The idea of executing some random closed source code is a bit scary, but perhaps marda could provide a nice sandbox for it for closed source projects... e.g., if they provide a Docker image with a single prebuilt executable, the MaRDA runner should somehow build and run it in a context where it has e.g., no access to the internet or the user's file system --- this feels like shaky ground though. (You could make the same arguments about unvetted open source code, of course, in terms of data being exfiltrated)

usage is a key attribute, but its syntax is not yet 100% defined (see #19 ).

I imagined that this "semi-autonomous" usage would fit in as another feature level? e.g. you can provide an extractor that has no usage at which point the registry entry is just an "advert" -- I think this would be fine (and would give us a much more populated registry, hopefully...)

PeterKraus commented 1 year ago

I don't think we should exclude proprietary software that can be installed in a reproducible way (e.g. docker / snap or whatever is fashionable these days) without having a good reason to do so. The installation should eventually follow a similar syntax to usage, whatever we settle on, and the source_repository being really just metadata.

I think having an usage keyword might be a feature level on its own. I'll mention it in that issue.

marda-alliance / metadata_extractors_schema

`Extractor:` some `attributes` should be required. #18