Automatically Hash Uploaded Documents to Produce an IPFS CID

katelynsills commented 1 year ago

User Story

As a researcher, when I upload an image or .wacz file to UWAZI, I want to be able to view the asset's CID and use the CID to query for additional metadata. This CID should be the exact same CID as if I had run Kubo ipfs add with the settings:

--only-hash=true
--wrap-with-directory=false
--cid-version=1
--hash=sha2-256
--pin=true 
--raw-leaves=true
--chunker=size-262144
--nocopy=false
--fscache=false
--inline=false
--inline-limit=32

Note that many of these settings are the default, but for the purposes of being clear in case the default ever change, we want to specify them explicitly.

Curent Method

Currently, @makew0rld has written a script which queries the api (/api/search) for all documents, filters for documents that do not have a CID yet, calculates the CID using ipfs add and then uses the API to write the CID to the entity metadata.

This requires installing Kubo on the server that is hosting the UWAZI backend, and has the benefit of 1) not needing to transmit the files, and 2) making use of the UI that already exists for custom entity metadata. Furthermore, by using the API, we correctly trigger indexing and validation that direct MongoDB access would not. (The use of the API and the metadata field were great finds by @makew0rld!)

TODOs

When we want to come back to improve this segment of the prototype, we will want to tackle these additional tasks:

[ ] Since currently we are adding CIDs to entities, but there may be multiple files per entity, we either need to switch to support CIDs for multiple files per entity, or we need to officially decide to only support one file per entity.
[ ] Automatically trigger the script when a new file is uploaded, rather than running the script manually
[ ] Limit the API query to files that do not have CIDs yet, rather than returning all files and filtering afterwards

Prior Ideas about Alternative Approaches

### Bypass UWAZI entirely Note: this was rejected in favor of the current solution because putting data in MongoDB directly wasn't triggering the uwazi code for indexing, and custom UI would be needed to display file data. The quickest solution may be to bypass UWAZI entirely. The uploaded files are located in a folder on the UWAZI backend called "uploaded_documents". Information about the uploaded files are in UWAZI's MongoDB database under a collection called "files". Here is the file [model/schema](https://github.com/huridocs/uwazi/blob/development/app/api/files/filesModel.ts). MongoDB has command line access. So, the rough idea would be to install Kubo on the server that is hosting the UWAZI backend, run `ipfs add` with access to the `uploaded_documents` files, and then use the MongoDB command line access to add a new key called `SHA256CID` with the CID as the value. If this could be triggered whenever a new File document is added to MongoDB, that would be ideal. For the purposes of a prototype, if we need to run it manually, that might be ok. ### Follow the service model Note: this was rejected because the service model appears to create a separate server, and we did not want to transmit 65GB+ files just to hash them. It seems like the cleanest way is to follow [the model of already existing UWAZI services](https://github.com/huridocs/uwazi/tree/development/app/api/services). See the [PDF Metadata Extraction docs](https://uwazi.readthedocs.io/en/latest/sysadmin-docs/set-up-pdf-metadata-extraction.html) for an example of expected installation and usage. However, we don't want have to transfer the files at all to be able to hash them, as that would mean potentially transferring 65GB+. We would want the CID to be written back to the MongoDB doc (probably through the mongoose model) after the file is hashed. For the CID code, there are many official implementations. There's [a JS implementation](https://github.com/multiformats/js-multiformats) that would be nice to use since UWAZI is already a Node project. But if we find that is too slow, we might want to look into something like [Rust](https://github.com/multiformats/rust-cid). Just using Kubo directly would be fine.

Further Links

See https://github.com/huridocs/uwazi/issues/5629 for communication with the UWAZI team about how to best add this feature.

makew0rld commented 1 year ago

UWAZI has custom metadata fields built-in to the UI. I wonder if maybe a better path than modifying the "files" collection would be to add instances of an existing custom metadata attribute.

Here's an example one I created:

The content of these fields goes into the "entities" collection like so:

{ "_id" : ObjectId("6426fea8789706a3ce183ed3"), "language" : "en", "mongoLanguage" : "en", "sharedId" : "vs9r9aojbo", 
"title" : "Example Document",
"template" : ObjectId("5bfbb1a0471dd0fc16ada146"), "published" : false,
"creationDate" : 1680277160290, "editDate" : 1680277901735,
"metadata" : { "cid" : [ { "value" : "abc123" } ] },
"user" : ObjectId("58ad7d240d44252fee4e6212"), "permissions" : [ { "refId" : "58ad7d240d44252fee4e6212", "type" : "user", "level" : "write" } ], "obsoleteMetadata" : [ ], "__v" : 0 }

So I think the code could put the CID values in there, and it would already be integrated with the UI:

makew0rld commented 1 year ago

POST data for changing custom metadata field:

-----------------------------123600231541551552133023611618
Content-Disposition: form-data; name="entity"

{"_id":"642708bad2bf415bc5d4377c","language":"en","metadata":{"cid":[{"value":"abc12345"}]},"sharedId":"mep9lmsm2f9","template":"5bfbb1a0471dd0fc16ada146","title":"Slide backlight template","attachments":[],"documents":[{"_id":"642708bad2bf415bc5d43791","entity":"mep9lmsm2f9","type":"document","filename":"1680279738877uizclgq1spa.pdf","originalname":"slide_backlight_template.pdf","mimetype":"application/pdf","size":24022,"status":"ready","creationDate":1680279738899,"language":"other","toc":[],"totalPages":1}]}
-----------------------------123600231541551552133023611618--

I was able to pare this down into a more more minimal request:

{"_id":"642708bad2bf415bc5d4377c","metadata":{"cid":[{"value":"abc123"}]}}

So only the ID needs to be found for the file.

curl command:

curl 'http://localhost:3000/api/entities' -X POST -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/111.0' -H 'Accept: application/json' -H 'Accept-Language: en-CA,en-US;q=0.7,en;q=0.3' -H 'Accept-Encoding: gzip, deflate, br' -H 'X-Requested-With: XMLHttpRequest' -H 'Content-Type: multipart/form-data; boundary=---------------------------6869281615867978331831713992' -H 'Origin: http://localhost:3000' -H 'Connection: keep-alive' -H 'Cookie: locale=en; connect.sid=s%3AhmjCXfpiHy-GY5X3RoBT582DcULMmiKS.W0Nm%2FL%2BcXtwo6%2FxrIGAbsFjzJFXmY6kjjjq38pLOVXc' -H 'Sec-Fetch-Dest: empty' -H 'Sec-Fetch-Mode: cors' -H 'Sec-Fetch-Site: same-origin' --data-binary $'-----------------------------6869281615867978331831713992\r\nContent-Disposition: form-data; name="entity"\r\n\r\n{"_id":"642708bad2bf415bc5d4377c","metadata":{"cid":[{"value":"abc123"}]}}\r\n-----------------------------6869281615867978331831713992--\r\n'

katelynsills commented 1 year ago

The entity metadata was a great find! Since UWAZI allows for multiple files per entity, we will need to restrict our prototype to one file per entity. I realized I was already making that assumption in how the UI would work, so let's assume that we're only working with one file per entity. If we need to, we can come back and enable multiple files per entity later.

starlinglab / authenticated-attributes