Added schema snippets that help to match files to file types.

markus1978 commented 1 year ago

Added a few attributes and specialised classed that would allow to match a specific file to a file type.

This is all kind of file type metadata that would be helpful to automatically find a described file type for a specific concrete file at hand. These are all ideas that we implemented in NOMAD in one way or another to find the right extractor for a given file.

markus1978 commented 1 year ago

This is my usecase. If someone uploads an arbitrary file to my ELN and I have a whole registry of tools to process it, the ELN still needs to figure out which tool to use. Identifying the FileType would give you the connection. Otherwise, I need to rely on the source (e.g. user) to tell me the type.

To apply a tool, the ELN needs to figure out the FileType one way or another. This is why you ask for a FileType identifier, right? Maybe it is difficult, but If you agree that it is a valid use-case, why wait for the next MaRDA WG to figure it out? I am not sure how additional information would reduce the useful-ness.

Let's say we are not using the registry to identify FileTypes. The tools in the registry still need to somehow tell what their intended input FileType is. And it ought to be more specific than JSON, HDF5, csv, etc. Why not describe the FileType by characteristics that would help to identify a file's type?

NicolasCARPi commented 1 year ago

Isn't the mime type + extension + fileinfo sufficient to figure out a file type? This is how it's done in the PHP world: https://github.com/symfony/mime/blob/6.2/MimeTypes.php

markus1978 commented 1 year ago

The general class of file types, you can figure out like this, e.g. HDF5, json, XML, csv, ASCII. But this information alone only gets you half-way, if there are a couple dozens tools for HDF5, where each of those only understands a very specific "flavour". If your tool only understands a certain HDF5 "flavour" and won't run on every HDF5 file, specifying its input file type as HDF5 would be wrong, wouldn't it? I think the term FileType should be understood in such a way.

markus1978 commented 1 year ago

I guess there are three scopes you can go for:

just name the input file type
describe the input file type enough to match tools to files
fully specify the allowed input file type with schemas and everything

My intention with this PR was to go for 2.

PeterKraus commented 1 year ago

Thanks for your input.

My intention with this WG was to go for 1. and eventually work our way towards 3. using undiscovered unicorn magic coming from another MaRDA WG.

If you guys think 2. would be useful for your ELNs right now, I'm on board in principle. However: I don't know how to do this, and I don't have the capacity right now to do this, so I will need your help.

@markus1978 could you mock up a quick example of how these extra FileType attributes would work? I have trouble visualising it with what you proposed in this PR at the moment. Especially when dealing with JSON or ASCII, as you mentioned. It would then be great if @NicolasCARPi and/or any other ELN developer could chime in whether that implementation of this idea is helpful for them, so that we can cover as many use cases in the initial design as practicable.

kjappelbaum commented 1 year ago

It would then be great if @NicolasCARPi and/or any other ELN developer could chime in whether that implementation of this idea is helpful for them, so that we can cover as many use cases in the initial design as practicable.

the query thing is also what we use in practice. In our drag and drop fields we have some logic that first checks the extension and then performs some query that is indicative of a given filetype and then calls a specific parser.

markus1978 commented 1 year ago

I can understand why 1. and 3. are more attractive. 1. being the easy one and 3. the engineering-wise soundest one. With 2. being the ugly, more fuzzy, dealing-with-false-positives-and-their-like one. But, NOMAD used 2. to match 12 million unknown files to their respective extractor.

No hard feelings, if you decide to define the scope differently and reject this. This super-quick-and-dirty PR is only for communicating ideas. For this and other stuff that is not the WG's scope, maybe there should be an registry extension mechanism within the WG's scope?

Just to get the ideas across. Here some very none OO, highly unoptimised pseudo code algorithm to utilise filetype information to find the right extractor for a given file in file_path.

def get_extractor(file_path):
    mime = magic(file_path)  # magic is an actual library to figure out the mime type of files.
    for tool in registry.tools:
        file_type = tool.file_type

        if mime not in file_type.mimetypes:
            continue

        if not regex(file_type.filenames, file_path):
            continue

        if isinstance(file_type, TextFileType):
            some_content = read(file_path, MAX_CONTENT_MATCH_SIZE)
            if not regex(file_type.content, some_content):
                continue

       if isinstance(file_type, StructuredFileType):
            reader = get_structured_format_reader(mime)
            data = reader.read(file_path)
            if file_type.characteristic_keys not in data:
                 continue
            if file_type. characteristic_key_values not in data:
                 continue
            if not xpath(file_type. query, data):
                 continue

       return tool

PeterKraus commented 1 year ago

Thanks a lot @markus1978, I think I get how this would work now.

Would "we" (MaRDA WG) provide this get_extractor function (or a "reference implementation thereof"), or would we simply define the FileType schema that provides the information and leave these implementation details completely for downstream teams including folks like the three of you?

ml-evs commented 1 year ago

Great discussion! If we can help people do 2. with what we standardize then I don't see any reason to miss it out. My concern was that it (to me) requires some "meta-standard" to be useful, on how to specify the query and the tools that are required to execute it (e.g., xpath, regex, JSON keys, etc.) -- if there are already many projects doing this in different ways then chances are our version of the metastandard would not be that useful. Perhaps I am overcomplicating it though?

I can see that this is where @markus1978 original inheritance ideas would work nicely, as e.g., the inherited structured file type for JSON could interpret query as nested field names, whereas the equivalent for XML could be xpath and text files could just use a regex. I would still try to introduce this via composition rather than inheritance but would need some more time to think about it.

Somehow this implies a MaRDA FileType Detector client API on top of everything we have previously discussed; I'm not sure deploying a service for file type detection that implements this client is within scope of this WG (at least in the next 9 months!), but it feels like this is what is required to make this actually useful?

ml-evs commented 10 months ago

Closing for more discussion in #45; this PR can then be revisited with the latest schema updates.

marda-alliance / metadata_extractors_schema

Added schema snippets that help to match files to file types. #9