Closed lzehl closed 4 years ago
We need a property named hash or digest, which contains a digest of the file contents to check for corruption or over-writing of the file contents (see https://en.wikipedia.org/wiki/File_verification)
@apdavison agreed that this would be useful. I have some questions though: 1) who will / is supposed to generate the hash? (@olinux this would be also a question for you) 2) should this be also a property of fileBundle and fileRepository?
I propose this nested structure:
hash
digest
Output of hashing algorithmalgorithm
Name of the algorithm used to generate the digestWe could also think about implementing this as a list, so multiple to digests from different algorithms are supported.
@skoehnen looks good, but if a list is a possibility we should consider it as a independent schema. Makes it easier to be reused / referenced.
Same actually also goes for quantities (number + unit), but this would be a different topic / issue.
@skoehnen 👍
@lzehl the hash should be generated automatically: by the automatic indexing tool for CSCS, by the planned provenance API, etc.
@skoehnen looks good, but if a list is a possibility we should consider it as a independent schema. Makes it easier to be reused / referenced.
Good point.
What is the consensus, should we support multiple hashes? This would also will introduce the need to add some kind of check that prevents multiple digests of the same hash. That coupled with the need for an additional schema, maybe it is not worth the trouble.
@lzehl the hash should be generated automatically: by the automatic indexing tool for CSCS, by the planned provenance API, etc.
@apdavison Should we add an attribute that stores the origin of the hash, if it was generated by indexing or by the provenance tracking?
@skoehnen I think storing the origin of the hash would be overkill, it should anyway be the same independent of which tool generated it.
re: multiple hashes, I think it is not worth the trouble
@skoehnen from what I read from your and @apdavison comments is to define the hash as
hash
digest
algorithm
I would keep it as a separate schema that we link to, because nested schemas will create some difficulties for the KG team. Nonetheless keep the count to 1.
@apdavison & @skoehnen should the hash be a property of all file related schemas (fileRepository #79 , fileBundle #80 , and fileInstance)?
@lzehl: yes, hash should be a property of fileRepository and fileBundle as well.
Defining how to calculate such a hash is more complicated than for individual files, however. A quick search yielded this: https://github.com/andhus/dirhash but I don't know how widely adopted this is.
@lzehl: yes, hash should be a property of fileRepository and fileBundle as well.
Defining how to calculate such a hash is more complicated than for individual files, however. A quick search yielded this: https://github.com/andhus/dirhash but I don't know how widely adopted this is.
Does a file bundle always have an attached archive? @olinux That would be a good workaround.
okay! I've updated the documentation of the schema above and included the hash.
I'll do the same on the other file related schemas.
1. who will / is supposed to generate the hash? (@olinux this would be also a question for you)
and the hash should be generated automatically: by the automatic indexing tool for CSCS, by the planned provenance API, etc.
I do agree - it's not practical to let the curators add the hashes manually - this has to happen in the file registration mechanism. Ideally, we could profit from already pre-calculated hashes (e.g. SWIFT object store provides some) so we don't have to run an analysis on top of the file. Supporting different algorithms make sense - we don't need to have all variants for all files though since having a consistent algorithm per file should be sufficient for the validation of change.
@olinux thanks for the feedback. I agree, if you can auto-registrate the repository with all folders and files the hash should be automatically generated as well.
Note: all file schemata should be though also editable by the user in case of externally hosted data which you could not auto-registrate. But in such cases the user would most likely also do this computationally and not manually.
TODO: change mediaType to contentType
Suggestions:
To handle the fact that the same fileInstance can be stored in various places (e.g. once in CSCS, once in Jülich...) we could have an array of resourceLocators or even introduce a new very simple object "FileLocation" of which the FileInstance can have multiple.
isPartOf should be pointing to FileRepository instead - for the above reason, this might be an array as well. IMHO the fileinstance shouldn't know about the filebundles it is connected to but rather the other way around (the file bundles keep a list of their files)
shouldn't be "storageSize" a "QuantitativeValue" instead?
yes! forgot to update this one... I've corrected the docu above
The
fileInstance.schema.json
will be used to identify a single file that is part of a research product version. Note: For some repository hosts (e.g. EBRAINS) the files can be detected semi-automatically within the KG system as their grouping into the corresponding hierarchical file bundle structure of the repository.According to the current documentation, this schema will have the following properties:
@type
[expects: constant ("https://openminds.ebrains.eu/core/fileBundle"
), count: 1]@id
[expects: free text, count: 1]Note for controlledTerm.schema.json for...
specificRoleFileUsage
): expects JSON-LDs forscreenshot
,icon
,preview
, + ?