v3 schema-revision: fileInstance.schema.json

lzehl commented 4 years ago

The fileInstance.schema.json will be used to identify a single file that is part of a research product version. Note: For some repository hosts (e.g. EBRAINS) the files can be detected semi-automatically within the KG system as their grouping into the corresponding hierarchical file bundle structure of the repository.

According to the current documentation, this schema will have the following properties:

@type [expects: constant ("https://openminds.ebrains.eu/core/fileBundle"), count: 1]
@id [expects: free text, count: 1]
content [expects: free text, count: 0 - 1] (this property has to be added "manually" in all cases)
contentType [expects: contentType.schema.json, count: 1]
hash [expects: hash.schema.json, count: 1] (this schema has to be added "manually" if KG does not auto. detected the file)
isPartOf [expects: fileBundle.schema.json, count: 1 - N] (conceptual groupings need to be added "manually")
name [expects: free text, count: 1]
resourceLocator [expects: string (format: uri), count: 0 - 1]
specificRole [expects: controlledTerm.schema.json, count: 0 - 1]
storageSize [expects: quantitativeValue.schema.json, count: 1]

Note for controlledTerm.schema.json for...

specificRole (specificRoleFileUsage): expects JSON-LDs for screenshot, icon, preview, + ?

apdavison commented 4 years ago

We need a property named hash or digest, which contains a digest of the file contents to check for corruption or over-writing of the file contents (see https://en.wikipedia.org/wiki/File_verification)

lzehl commented 4 years ago

@apdavison agreed that this would be useful. I have some questions though: 1) who will / is supposed to generate the hash? (@olinux this would be also a question for you) 2) should this be also a property of fileBundle and fileRepository?

skoehnen commented 4 years ago

I propose this nested structure:

hash
- digest Output of hashing algorithm
- algorithm Name of the algorithm used to generate the digest

We could also think about implementing this as a list, so multiple to digests from different algorithms are supported.

lzehl commented 4 years ago

@skoehnen looks good, but if a list is a possibility we should consider it as a independent schema. Makes it easier to be reused / referenced.

Same actually also goes for quantities (number + unit), but this would be a different topic / issue.

apdavison commented 4 years ago

@skoehnen 👍

@lzehl the hash should be generated automatically: by the automatic indexing tool for CSCS, by the planned provenance API, etc.

skoehnen commented 4 years ago

@skoehnen looks good, but if a list is a possibility we should consider it as a independent schema. Makes it easier to be reused / referenced.

Good point.

What is the consensus, should we support multiple hashes? This would also will introduce the need to add some kind of check that prevents multiple digests of the same hash. That coupled with the need for an additional schema, maybe it is not worth the trouble.

skoehnen commented 4 years ago

@lzehl the hash should be generated automatically: by the automatic indexing tool for CSCS, by the planned provenance API, etc.

@apdavison Should we add an attribute that stores the origin of the hash, if it was generated by indexing or by the provenance tracking?

apdavison commented 4 years ago

@skoehnen I think storing the origin of the hash would be overkill, it should anyway be the same independent of which tool generated it.

apdavison commented 4 years ago

re: multiple hashes, I think it is not worth the trouble

lzehl commented 4 years ago

@skoehnen from what I read from your and @apdavison comments is to define the hash as

hash

digest
algorithm

I would keep it as a separate schema that we link to, because nested schemas will create some difficulties for the KG team. Nonetheless keep the count to 1.

@apdavison & @skoehnen should the hash be a property of all file related schemas (fileRepository #79 , fileBundle #80 , and fileInstance)?

apdavison commented 4 years ago

@lzehl: yes, hash should be a property of fileRepository and fileBundle as well.

Defining how to calculate such a hash is more complicated than for individual files, however. A quick search yielded this: https://github.com/andhus/dirhash but I don't know how widely adopted this is.

skoehnen commented 4 years ago

@lzehl: yes, hash should be a property of fileRepository and fileBundle as well.

Defining how to calculate such a hash is more complicated than for individual files, however. A quick search yielded this: https://github.com/andhus/dirhash but I don't know how widely adopted this is.

Does a file bundle always have an attached archive? @olinux That would be a good workaround.

lzehl commented 4 years ago

okay! I've updated the documentation of the schema above and included the hash.
I'll do the same on the other file related schemas.

olinux commented 4 years ago

1. who will / is supposed to generate the hash? (@olinux this would be also a question for you)
and the hash should be generated automatically: by the automatic indexing tool for CSCS, by the planned provenance API, etc.

I do agree - it's not practical to let the curators add the hashes manually - this has to happen in the file registration mechanism. Ideally, we could profit from already pre-calculated hashes (e.g. SWIFT object store provides some) so we don't have to run an analysis on top of the file. Supporting different algorithms make sense - we don't need to have all variants for all files though since having a consistent algorithm per file should be sufficient for the validation of change.

lzehl commented 4 years ago

@olinux thanks for the feedback. I agree, if you can auto-registrate the repository with all folders and files the hash should be automatically generated as well.

Note: all file schemata should be though also editable by the user in case of externally hosted data which you could not auto-registrate. But in such cases the user would most likely also do this computationally and not manually.

lzehl commented 4 years ago

TODO: change mediaType to contentType

olinux commented 4 years ago

Suggestions:

To handle the fact that the same fileInstance can be stored in various places (e.g. once in CSCS, once in Jülich...) we could have an array of resourceLocators or even introduce a new very simple object "FileLocation" of which the FileInstance can have multiple.
isPartOf should be pointing to FileRepository instead - for the above reason, this might be an array as well. IMHO the fileinstance shouldn't know about the filebundles it is connected to but rather the other way around (the file bundles keep a list of their files)

olinux commented 4 years ago

shouldn't be "storageSize" a "QuantitativeValue" instead?

lzehl commented 4 years ago

yes! forgot to update this one... I've corrected the docu above

openMetadataInitiative / openMINDS_core

v3 schema-revision: fileInstance.schema.json #81