marda-alliance / metadata_extractors_schema

Archive of MaRDA Metadata Extractors Schema. See datatractor/schema for the current repository.
https://github.com/datatractor/schema
MIT License
6 stars 1 forks source link

Schema hints for file type detection #45

Closed ml-evs closed 4 months ago

ml-evs commented 10 months ago

This is my usecase. If someone uploads an arbitrary file to my ELN and I have a whole registry of tools to process it, the ELN still needs to figure out which tool to use. Identifying the FileType would give you the connection. Otherwise, I need to rely on the source (e.g. user) to tell me the type.

To apply a tool, the ELN needs to figure out the FileType one way or another. This is why you ask for a FileType identifier, right? Maybe it is difficult, but If you agree that it is a valid use-case, why wait for the next MaRDA WG to figure it out? I am not sure how additional information would reduce the useful-ness.

Let's say we are not using the registry to identify FileTypes. The tools in the registry still need to somehow tell what their intended input FileType is. And it ought to be more specific than JSON, HDF5, csv, etc. Why not describe the FileType by characteristics that would help to identify a file's type?

_Originally posted by @markus1978 in https://github.com/marda-alliance/metadata_extractors_schema/issues/9#issuecomment-1403190937_

This issue can be used to track ongoing discussion in #9 whilst separating out the implementation details. We definitely want to support something like this in the future.

ml-evs commented 8 months ago

Another very simple approach would be to add common file extensions to each filetype registry entry, so at least we can have a best guess at which extractor might work. I might make a PR about this today.

PeterKraus commented 4 months ago

Partially implemented in the PR above. Magic bits and mime types (and other mechanisms) will have to wait.