huridocs / uwazi

Uwazi is a web-based, open-source solution for building and sharing document collections
http://www.uwazi.io
MIT License
242 stars 80 forks source link

Automatically Hash Uploaded Documents to Produce an IPFS CID #5629

Open katelynsills opened 1 year ago

katelynsills commented 1 year ago

Note: this is categorized as a feature request, but we would be happy to build this ourselves. I'm mostly looking for guidance on the preferred way to build it as a service, and to see if there is any interest in eventually integrating the feature into UWAZI itself. Thanks!

Is your feature request related to a problem? Please describe.

As part of our use case for Starling Lab, we would like to hash documents uploaded to UWAZI. This serves two purposes: 1) we can store the hash and thus detect any tampering or data corruption, and 2) we can use the hash as an identifier when we record metadata or refer to the document in other contexts.

Specifically, we use an IPFS CIDv1 hash. CIDv1 has a few major benefits over a plain SHA256: it is self-describing so there's less confusion about how the hash was produced, and it is based on chunking, meaning that for large files, there are intermediate hashes that can be used to deduplicate storage. And of course, CIDv1 is used by IPFS to reference documents, meaning that users of UWAZI would hypothetically be able to see whether their local file matches a remote file and can share files by merely sharing the IPFS link if the document is already on IPFS.

Describe the solution you'd like

Ideally, we would like to build an optional service that performs the hashing automatically on upload on the UWAZI backend. For our use case, we would like to produce the same CID that an actual upload to IPFS would produce. However, we do not want to actually upload to IPFS, as that would make the documents public.

Have you considered an alternative?

We've considered hashing in the browser, or requiring the user to re-upload the file elsewhere to hash it, but some files in our use case may be very large, and it is easiest to perform the hashing where the file is already being uploaded (the UWAZI backend) rather than transmitting the file elsewhere.

Additional context [Edit] Here is our current solution, a script which is run manually.

We would be happy to build this ourselves! Just wanted to reach out to see if there might be any guidance or interest in what we build. Thanks again!

RafaPolit commented 8 months ago

Revisiting this after almost a year, we believe this could be of use for several cases, particularly our preservation flow using the Preserve app. We will look into this. Thanks.