neicnordic / sensitive-data-archive

https://neic-sda.readthedocs.io
GNU Affero General Public License v3.0
3 stars 7 forks source link

[sda-download] Implement random access in encrypted files #696

Closed dbampalikis closed 6 months ago

dbampalikis commented 8 months ago

As an sda-user I want to be able to download specific parts of an encrypted file In order to be able to get only the region I am interested in

The service currently allows to download specific byte ranges of unencrypted files but in the case of encrypted files, that's only possible for byte ranges that start from the beginning of the file. We need to be able to support random byte ranges of encrypted files, to support the htsget case.

A/C

pontus commented 8 months ago

Assuming we want to avoid reencryption of possibly large amounts of data, this should use the intended support for this in the crypt4gh file format.

In short, each file/data stream is split into 64kbyte blocks that are encrypted/ separately. This is also the smallest unit for decryption as these blocks are what MACs are created for.

This means that to send logical byte 65535-65536 (base 0), one would need to send the reencrypted header and the first two data blocks (65536+extra bytes for crypt4gh). As the receiver only want those two bytes, there would also need to be a data edit list in the header to instruct it to throw away bytes 0-65534 and 65537-131071.

So the header reencryption service needs to be able to accept a dataeditlist to be put in the header.

Currently, I think there's only the chacha20_ietf_poly1305 cipher, so a fixed block size of 65564 can be used, but possibly it might make sense to have a function in the crypt4gh library that takes a header and responds with the block size (or similar).

pontus commented 8 months ago

For both the unencrypted and encrypted data out case, there will also be a performance motive to not request the entire object from the archive and only return the wanted bit but rather only requesting the range actually needed.

For the encrypted case, this is fairly simple - the s3 download client could pass a Range with the bytes wanted.

The question would be if we would prefer having a unified handling for unencrypted and encrypted.

For the unencrypted case, it might make sense to have a reader that maps calls to Read to a s3 call that is essentially managed synchronously or something similar.

MalinAhlberg commented 8 months ago

When decrypting a partial file, the resulting file size should be what was originally asked for, not more. Ie, the extra data passed on to meet the next data boundary block should be removed. Use data-edit-list. See https://github.com/neicnordic/sensitive-data-archive/pull/695#discussion_r1512952339