umccr / htsget-rs

A server implementation of the htsget protocol for bioinformatics in Rust
https://samtools.github.io/hts-specs/htsget.html
MIT License
39 stars 9 forks source link

feat: crypt4gh #223

Open mmalenic opened 10 months ago

mmalenic commented 10 months ago

I'm creating this as a draft pull request because I don't think it should be merged into the main branch as it is. A large part of the code is inside the async-crypt4gh crate which doesn't belong in this repository. I wanted to show this as a demonstration of Crypt4GH and htsget-rs using a UrlStorage backend. The main explanation of the logic is inside the docs/crypt4gh folder which contains a diagram.

Recent additions include creating/reading edit lists in async-crypt4gh, as the edit list implementation in the crypt4gh-rust crate was incomplete.

I also want to mention that none of this is set in stone. See the alternative design considerations in the ARCHITECTURE.md docs.

brainstorm commented 9 months ago

Adding this crate to crypt4gh-rust(among other breaking changes) happening over this branch: https://github.com/umccr/crypt4gh-rust/tree/break_public_api_simplify_key_mgmt_and_tests

pontus commented 6 months ago

Hi, I'm involved in a project where we're to a large part using the same software stack as in GDI and Federated EGA (and we're to use the same stack for more), and recently I've done some work adjacent to what's being worked on for htsget support.

I haven't looked at this PR but I believe I have a quite good idea of what's being implemented to support in the archive software stack, and there are parts I don't know how they are intended to work. That may be a failure of imagination on my part or use cases that you don't care about or current limits in the archive implementation that hasn't been addressed yet, but there are at least some things I feel I need to ask about.

What's a good way to do so? Should I just raise them here or over e.g. e-mail or some chat? If needed I can certainly do meetings but suspect time difference might make that tricky to set up.

mmalenic commented 6 months ago

Hi @pontus, more than happy to answer any questions. Feel free to post them here, or open a separate issue for them.

pontus commented 6 months ago

So, to start with - with the design as I understand it (stitching together crypt4gh blocks into a single file), it's not possible to utilise different resources (files on the backend) as there's only one symmetric key per crypt4gh file/stream and picking blocks from different backend files would cause mismatch for the symmetric key.

Similarly, since crypt4gh doesn't have length information for data blocks, if the last block of the file isn't full sized (65536 data bytes), it can't be used with anything following it since the MAC and following data blocks would be misaligned (and it's difficult to fix by padding since one would also need to calculate the block MAC).

So those are some use-case restrictions that I wanted to check if there is anything I'm missing technically and also if those use case restrictions are considered unimportant.

mmalenic commented 6 months ago

Yes, that's right. If you are stitching together different files/resources, then care would need to be taken to edit/remove data for the header and EOF blocks to make sure that the files remain readable. This is somewhat a limitation of the specs, i.e. the htsget spec requires that non-header byte ranges returned compose an entire file, headers and EOF blocks included. This problem would arise when using non-Crypt4GH files too. For example, the BAM spec does allow "insignificant EOF marker" blocks that can be ignored if they are contained in the middle of a file, but I don't think CRAM provides the same option.

However, I'm not quite clear on the context behind this, do you have an example of where it would be necessary to stitch together files, vs just reading files individually?

I'd say that this kind of operation would require the client to properly edit the htsget responses to stitch together files, however it should be possible to do so by individually reading responses from htsget. It's also possible that the htsget server could help make this easier, for example, by labelling which URL tickets correspond to the header bytes/EOF block, or what data to pad/remove. This information is already partially supplied with the class field in the response, which determines header vs body bytes. A different field could be used to extend this and support labelling EOF data.

At this point though, wouldn't it be simpler to just leave the files unmerged? Or is there some issue with performance or storage that I'm not aware of?

pontus commented 6 months ago

Thanks for the response, for starters, while I'm not sure I understand completely what you mean, I wanted to start by checking that my understanding was correct and those use cases get broken by design - compared to (my understanding of) what the htsget protocol supports. (For clarity; this would be on the crypt4gh level, so in addition to whatever issues there might be with constructing the underlying data stream.)

It may be that the protocol design was overly ambitious but then again, I've been in on talks in my current area of interest (imaging) about using htsget for that, which would likely run into the "several resources" issue.

As to actual practical issues; while I don't think the multiple resources would be that common with current usage patterns, it feels lik a range from the final block not at the very end is something that might not be that uncommon.

mmalenic commented 6 months ago

I wanted to start by checking that my understanding was correct and those use cases get broken by design

Yes, that's correct. I don't think this is a problem that could be solved by the htsget server alone, and it would require cooperation with the client. Although, I'd say it's probably doable. It seems possible that the client could retrieve requests from the htsget server one at a time and make the edits/merges that it needs. For example, it could request one resource, decrypt it, then request another resource, decrypt it (potentially with another key), and then combine those resources.

it's not possible to utilise different resources (files on the backend) as there's only one symmetric key per crypt4gh file/stream and picking blocks from different backend files would cause mismatch for the symmetric key.

I just want to clarify with this. It's possible to encrypt a single resource with multiple keys, and this is specifically supported by the Crypt4GH protocol. The htsget server can then return a Crypt4GH header with support for multiple keys. However, the URL tickets may include the last Crypt4GH block, which could be less than 64KiB. It's also possible to encrypt/decrypt different resources with different keys.

pontus commented 6 months ago

I just want to clarify with this. It's possible to encrypt a single resource with multiple keys, and this is specifically supported by the Crypt4GH protocol. The htsget server can then return a Crypt4GH header with support for multiple keys. However, the URL tickets may include the last Crypt4GH block, which could be less than 64KiB. It's also possible to encrypt/decrypt different resources with different keys.

True, thanks, I'd actually forgotten (and while we have this implemented, I'm not sure if it's ever been tested with our implementation, I'll try to look over the automatic tests for that at least).

But this still leaves the problem with having a final block (not 65535 bytes long) in any place other than the last, right?

mmalenic commented 6 months ago

But this still leaves the problem with having a final block (not 65535 bytes long) in any place other than the last, right?

Yes. The Crypt4GH spec seems to strongly imply that only the last block can be less than 64KiB. Although, since the last block also contains the nonce and MAC components, I don't think there is anything preventing decryption of a smaller block that is present in the middle of a stream if it's position is known (although current libraries may error).

pontus commented 6 months ago

Unfortunately not - with crypt4gh not having a package length, the way to get a package is either to read 65536 bytes+extras for the used encryption or until EOF.

I assume it was done this way to make it possible to make guarantee you can read data from anywhere in fairly cheaply (there are of course other ways it could have been done, but this is where we're at).

So, assuming the current crypt4gh standard, a packet in stream can't be shorter, meaning if one wants to use a final packet in stream, one would need to extend it somehow, but as that would calculating a new MAC involve I don't see how that could happen without being able to decode the packet.

It could possibly be worked around by having the htsget being able to access it (so having a private key, which seems non-optimal) or having the archive do special magic to extend short blocks, but that of course adds a layer of complexity (and it's not obvious to me it's better than e.g. a design composed of streams that are decrypted separately).

mmalenic commented 5 months ago

Yes, it would be possible for htsget/the archive to make these edits, but it would require decrypting the last block and having the associated key to do that. I think this would probably add more complexity than necessary for something like this, especially if the client was able to do it on their end.

If considering htsget with access to local data files (e.g. using LocalStorage), then this could be more suitable because the server would probably be able to decrypt those files anyway. However, Crypt4GH for LocalStorage or S3Storage is not implemented in this PR (although there are plans for it eventually).

pontus commented 5 months ago

Sorry for the delayed response here.

Do I understand correctly that you share my concern with the current design?

I think getting data via htsget is an important feature but am not terribly interested in exactly how it's done, but as for serving bits of crypt4gh encrypted data, it makes a lot more sense to me to use coordinates in the resulting decrypted stream than any encrypted stream because it frees the "client side" from dealing with the in archive file's data-edit-list (or assuming there is none). Similarly, if the header is included in the coordinate system, there's the problem that the size of the header may well depend on what bits of the file in archive is requested (because of e.g. different needs for the data-edit-list, different symmetric keys being included and so on). And I do have an interest in the code for the sensitive-data-archive not having to jump through too many hoops to support special cases.

mmalenic commented 5 months ago

Do I understand correctly that you share my concern with the current design?

I don't think I fully understand what the concern is, but I think you would like the client and backend to avoid doing extra work? I agree, and htsget-rs already attempts to do that as much as possible. At some point though, the system needs to interact with Crypt4GH if that's the goal. E.g. the client needs to understand how to decrypt data if it's receiving encrypted bytes.

it makes a lot more sense to me to use coordinates in the resulting decrypted stream than any encrypted stream

I'm not sure I understand what you mean here. Are you saying that the client should receive URL tickets from htsget-rs which represent unencrypted data from the archive backend? If so, this supported with the send_unencrypted_to_client config option.

if the header is included in the coordinate system, there's the problem that the size of the header may well depend on what bits of the file in archive is requested

This can definitely be addressed by htsget-rs by requesting more data if it is not enough. Note, that the archive backend doesn't need to be super smart here, it just needs to respond to HTTP range requests from htsget-rs, no need to convert between encrypted/unencrypted positions.

Is there a particular problem that you have encountered trying to set up htsget-rs with Crypt4GH? I think it would help me understand what kind of issues you are encountering with a concrete example. I'd be happy to help, and I'm pretty flexible with including features if it's possible.