Open ll4strw opened 5 months ago
Hello @ll4strw. You can already create standard conformant ISCCs for HDF files (actually for any file type) using the ISCC SUM SubType
Here is an example script to generate ISCCs for any file type:
"""Create an ISCC-CODE for any file"""
from os.path import basename
import iscc_sdk as idk
import iscc_core as ic
def code_iscc_sum(fp, ld_type="Dataset"):
# type: (str, str) -> idk.IsccMeta
"""
Generate a minimal ISCC-CODE (SubType SUM)
The ISCC-CODE SUM is a combination of the Data-Code and Instance-Code UNITS.
As such it can handle any file irrespective of the file format.
:param str fp: Filepath used for ISCC-CODE creation.
:param str ld_type: JSON-LD schema.org type of the identified file
:return: ISCC metadata including ISCC-CODE
:rtype: IsccMeta
"""
# Prepare basic metadata
with open(fp, "rb") as infile:
data = infile.read(4096)
meta = {
"@type": ld_type,
"filename": basename(fp),
"mediatype": idk.mediatype_guess(data, file_name=basename(fp)),
}
# Generate Data-Code and Instance-Code
data = idk.code_data(fp)
instance = idk.code_instance(fp)
iscc_code = ic.gen_iscc_code_v0([data.iscc, instance.iscc])
# Collect metadata from UNIT processors
meta.update(instance.dict())
meta.update(data.dict())
meta.update(iscc_code)
return idk.IsccMeta.construct(**meta)
if __name__ == '__main__':
fp = "/path/to/test.h5"
iscc_meta = code_iscc_sum(fp)
print(iscc_meta.json(indent=2))
The output then looks like this:
{
"@context": "http://purl.org/iscc/context",
"@type": "Dataset",
"$schema": "http://purl.org/iscc/schema",
"iscc": "ISCC:KUABEKSKSEJGHCQXVBIYEODRUPP5S",
"filename": "test.h5",
"filesize": 15072,
"mediatype": "application/x-hdf",
"datahash": "1e20a851823871a3dfd92c49834eb03ceba5b182f0e5095e6bf532f8774ff240172f"
}
This is the structure of the ISCC SUM:
(see INSPECT tab on https://huggingface.co/spaces/iscc/iscc-playground)
The Data-Code component of the ISCC SUM would allow to match HDF files that have minimal changes in the raw bitstream. How good that works in practical terms will depend on how deterministic HDF file encoding is.
If you want support for higher level ISCC-UNITs than it gets more complicated. In the end it is use-case dependent. You would need to think about what does HDF file content-similarty mean and what should similarity matching accomplish. For some guidance see: https://eval.iscc.codes/similarity/
Happy to discuss any ideas around metadata/content extraction from HDF files.
Hi @titusz , thanks for your prompt reply. For my HDF files I was using the NONE (0110) subtype with any available metadata I had without extracting them from the file itself. Indeed, as you said it would be very interesting if metadata extraction and content comparison occurred at the HDF level. While metadata extraction could be trivial with python, measuring HDF similarity might require some thinking. Luckily https://docs.h5py.org/en/latest/index.html can be of help.
Yes, if you have external metadata then the ISCC NONE SubType is a perfectly valid type for internal use-cases. The problem with using custom/external metadata is interoperability. If other parties only have the HDF file they can likely not reproduce the Meta-Code unit. As far as I see HDF supports internal metadata (Attributes
) attached to Group
and Dataset
objects. It shouldn´t be too hard to create some deterministic metadata extraction from the objects and their attributes.
I guess the best way forward would be to first create some kind of plugin system with hooks for handling specific file types. We could than create separate python packages that could register themselves. There are already some file types supported by the iscc-sdk
which I would love to put into a separate package. Otherwise this project will soon suffer from dependency hell :)
Speaking about metadata, how can the attributes in an iscc-schema.IsccMeta
object be filled automatically? From your example above, you create a dictionary which contains all iscc unit codes plus the total iscc code to construct a IsccMeta object. Most of the attributes in the schema will have a None
value though. Do I understand correctly that the idea is to add iscc metadata to the original digital object file so that a consistent iscc meta code can be produced? Thanks
Well the metadata situation is tricky. The SDK tries to support metadata extraction, mapping and embedding as good as that is possible for well known file types. See: https://github.com/iscc/iscc-sdk/blob/main/iscc_sdk/metadata.py. The extraction, embedding, and mapping logic is in the individual modules per modality. For example: https://github.com/iscc/iscc-sdk/blob/main/iscc_sdk/image.py#L255
We distinguish between different kinds of metadata. Metadata that is used for calculation of the Meta-Code is called Seed-Metadata. Those are only 3 fields: name
, description
, meta
. There are other fields that are embeddable/extractable but they are purely informational and not processed algorithmically.
For industry specific metadata you would usually define or pick an existing schema, serialize it into a Data-URL and put it into the meta
field. The long version: https://ieps.iscc.codes/iep-0002/ and also here https://ieps.iscc.codes/iep-0012/
Indeed, metadata definitions are research-field dependent. In a data management system that handles both data and metadata such as iRODS, ISCC codes could nonetheless be of great use. I created a POC at https://github.com/ll4strw/python-irodsclient-iscc/tree/main if you are interested.
Good morning, would it be possible to add support for the following data format, please?
https://en.wikipedia.org/wiki/Hierarchical_Data_Format
Thanks