samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
653 stars 174 forks source link

Crypt4GH v1.1 #649

Open AlexanderSenf opened 2 years ago

AlexanderSenf commented 2 years ago

Two ideas to improve Crypt4GH v1

(1) The original Crypt4GH standard validates the content of a file using by using MACs (message authentication codes) for each block of the file. This ensures that all data that is present is the same that was encrypted, but it doesn't validate that the blocks in a file are in the correct order, that no blocks are missing, and that the end of the file is reached.

(2) Relating to issue #506: when the edit list is used, there is no provision to prevent unused data to be included in a file, which would be entirely skipped by the edit list. This would enable a file with unwanted payload to be generated, without the tool using the file to notice it (it would be skipped; but it would still be there). It would be better to explicitly prevent this scenario.

Both could be added to a new version of the standard with backward compatibility.

silverdaz commented 2 years ago

Regarding (1), we do have MACs at the end of each block (that's the Poly1305), but they are unrelated to each others, so they can't ensure proper ordering of the segments, and that no segment are missing.

An idea was mentioned: we use the associated data from the AEAD mode we already have. We pass a sequence number as the authenticated data, say, the block number shifted by an offset that we write in the header. A bit like TCP sequence numbers.

That way, we do not need the overall checksum to see if blocks of data are missing or re-ordered.

Regarding (2), I suggest to toss it and put it where it is relevant, ie in some response headers from the HTSGET protocol (which is just piggy-backing on HTTP). That's useful only when downloading over a given transport mechanism just some requested chunks of the file without the need to re-encrypt the related data segments. Most of the time, we end up cutting out some bytes at the beginning of the first segment and at the end of the last segment.

If we use the block number in the AEAD as in (1), then the edit list must now contain the proper offset, otherwise, we won't be able to decrypt the segments. The matter gets more complicated when we start shipping non-consecutive segments over the wire using (1).

In conclusion, I suggest to bump it to version 2, implement the sequence numbering as described above, with an offset in the header, use the AEAD for each block and remove the edit list entirely. If one needs the edit list, one can always use version 1 of Crypt4GH.

What do you say?

silverdaz commented 2 years ago

Note: I can implement that in the python version as a proof-of-concept. Python is made for quick prototyping after all... (amongst other things).