unioslo / tsd-file-api

REST API for upload and download of files and JSON data
BSD 3-Clause "New" or "Revised" License
0 stars 8 forks source link

feat: Add checksums calculation support #188

Open kouylekov-usit opened 1 year ago

kouylekov-usit commented 1 year ago

Proposal: Create a api for checksum calculation.

Description:

The goal is to create an API that will calculate the checksum of imported files. The API will listen to rabbitmq message queue. The file API will add files to be checksum calculated. The files to be checksum calculated will be decided by the import request. If the request (PUT or END PATCH) has ?checksum URL parameter. The file API will queue a checksum job. Once calculated the check sum will be stored in .filename.checksum file. The result of the checksum can be displayed in the file info when the user is listing the imported files.

leondutoit commented 1 year ago

Where would it be provided?

kouylekov-usit commented 1 year ago

Where would it be provided?

I did not have time to make the proposal last evening. I got thinking during a meeting with a project.

leondutoit commented 1 year ago

The complicating factor is that the headers are written before the response is sent, so if you calculate it while reading the data, you can only send it in another response.

The only feasible solution that I've been able to come up with is storing the calculated hash in a persistent cache, if the client requests it while downloading the file, and then the client can fetch the checksum after download with another request. And then having already calculated and cached it, it would be efficient.

leondutoit commented 1 year ago

I actually started addressing this when I wrote https://github.com/unioslo/tsd-api-lib a while ago, but never took it further.

egiltane commented 1 year ago

A couple of thoughts:

General:

Real-time calculation:

Postum calculation:

kouylekov-usit commented 1 year ago

I agree with the sentiment.

A couple of thoughts:

General:

* Calculating cryptographic hashes (i. e. hashes that satisfy cryptographic properties rather than serving the purpose of securing transport) is noticeably expensive, especially when performed on files that comprise multiple GiB.

Real-time calculation:

* For PATCHing, initialisation vectors, if any, will have to be saved as part of the transactional state across requests.

Postum calculation:

* Expensive operations call for asynchronisicity (read: background tasks or spooling).

* Asynchronisicity calls for transactions (transaction IDs and the management thereof).

* To warrant concurrency, one will most likely need exclusive locking (write locks).

* Locking on NFS requires at least `fnctl()` (via `LOCK_EX`), as `flock()` won’t work reliably across file systems. Preferably one would even go for something more robust and custom, such as globally unique identifiers.

* In essence, hashing will increase the complexity of the code significantly.

* Complexity impedes robustness.

The point of my proposal is that the file API simply queues a request for calculation. All the calculation can be done be separate service that will handle all these issues.

leondutoit commented 1 year ago

I spent quite a lot of time thinking about this about a year ago, and ended up abandoning any implementation work because I felt that the added complexity would not be worth it.

That said, calculating checksums for downloads are much simpler than uploads - files are streamed by a single process, and are read from disk and written to the network in chunks. It would be trivial to pass the chunk to a hash function before flushing it to the network. That is what I made a proof-of-concept for here: https://github.com/unioslo/tsd-api-lib - the idea was that the final checksum would be stored in a cache (backed by a database), for fast retrieval in separate request.

For uploads @petterreinholdtsen and I explored many possible ideas. Calculating checksums on-the-fly, while the API is handling the upload, is basically just too complex to be worth it. Since you have multiple processes writing different chunks of the same file, and since hashing is stateful, you would need to multiplex the incoming data to an external hashing service. And for that to be correct, you would have to send the data to the chunking service after reading back what has been written to disk, because hashing what is handled by the API is not enough - you have to hash what is on disk.

This means that upload hashing has to be async, handled by a RabbitMQ listener service. It would still potentially take a very long time to hash say a 500+ GB file, which we sometimes see in production. That service could write hashes to the same DB in which the download hashes are kept, so that they could be requested via the file API, e.g. with a HEAD request, and some header indicating that one wants the latest hash value of the file. In the case of the hash not being completely calculated one could return https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/202 or https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/503, and then the client just has to try again later.

But all in all, I am not sure this is really worth the effort.