justinludwig / jpgpj

Java Pretty Good Privacy Jig
MIT License
75 stars 20 forks source link

Logging, shortcuts and other improvements #34

Closed lgoldstein closed 3 years ago

lgoldstein commented 3 years ago
justinludwig commented 3 years ago

Thanks for all your work on this. I don't understand the change to Encryptor.getLiteralBuffer(), however.

The encryption, compression, and literal buffers were all sized so that bouncy castle wouldn't split up the encryption, compression, or literal packets with additional partial-packet headers for small files; and that for large files, it wouldn't insert a ton of partial-packet headers (just one every 64k).

With this change, it looks like the literal packet will always be split up into a bunch of 512-byte partial packets for files of any size (unless the file name is really large, in which case it could be split into 1024-byte partial packets or larger).

So my questions are:

  1. Why vary the partial-packet size based on the size of the file name?
  2. Why not use the same sizing for encryption and compression partial-packets?
  3. If you're finding it useful to tune the partial-packet size, how about instead change the Encryptor.bestPacketSize() method to allow the max size to be configured? -- if the max size was say a settable property on the Encryptor, it could be set really small for use-cases where that would be optimal, or really big for use-cases where a really big partial-packet size would be better
lgoldstein commented 3 years ago

Why vary the partial-packet size based on the size of the file name? Why not use the same sizing for encryption and compression partial-packets?

The original code was doing

return new byte[bestPacketSize(meta.getLength())];

This is obviously wrong - the literal buffer size for a 1MB file named "X" is the same size as the one for a 1KB file named "X". This is due to the fact that the PGPLiteralDataGenerator encodes the name + format + length + last-modified-time - which means that the required buffer size depends on the file name size and not the file content size. In other words, the old code was extremely wasteful since it would allocate a huge buffer to encode just a few bytes.

I think this 100% aligned with the principle that you mentioned (and with which I agree wholeheartedly):

The encryption, compression, and literal buffers were all sized so that bouncy castle wouldn't split up the encryption, compression, or literal packets with additional partial-packet headers for small files; and that for large files, it wouldn't insert a ton of partial-packet headers (just one every 64k).

As far as

If you're finding it useful to tune the partial-packet size, how about instead change the Encryptor.bestPacketSize() method to allow the max size to be configured? -- if the max size was say a settable property on the Encryptor, it could be set really small for use-cases where that would be optimal, or really big for use-cases where a really big partial-packet size would be better

While it may provide extra optimization I don't think the cost in code complexity warrants it - at least not for the time being.

justinludwig commented 3 years ago

Ah -- yes, that's the first thing that the PGPLiteralDataGenerator writes to the buffer (format + name + last modified etc). But then the full plaintext of the file is also written through the buffer.

The bouncy-castle code is hard to follow, but it's trying to encapsulate the logic from here:

https://tools.ietf.org/html/rfc4880#section-5.9

It's straightforward until you get to the last bulleted item in section 5.9 (the literal data itself -- aka the plaintext). With the literal data, you either need to know its full exact length ahead of time, or you need to buffer the data, write a partial-packet header with the length of the partial data, write the partial data, buffer more data, write the next partial-packet header, write the partial data, and so on, until you get to the end of the data. The way this is output is described earlier in the spec, in section 4 -- with the partial-packet structure in particular described here:

https://tools.ietf.org/html/rfc4880#section-4.2.2.4

The (wrapped) BCPGOutputStream object that the PGPLiteralDataGenerator.open() method returns tries to encapsulate this partial-packet sizing mechanism (and also tries to encapsulate all the other variations of packet sizing from section 4, which is why it's so convoluted). It will hold onto the buffer we pass into PGPLiteralDataGenerator.open(), and use it to buffer and then write each partial chunk of plaintext as we push it through through the encryption pipeline. Not until the BCPGOutputStream's close() method is called, when it finally knows for sure the full length of the literal data, will it write the final length header and remaining buffered data.

So that's why, even though it initially doesn't look like it's doing anything, the buffer for the literal-data generator is actually providing the internal buffering (and corresponding partial-packet sizing) for the entire original plaintext content.

lgoldstein commented 3 years ago

Good to know - I will revert the change and update this PR

lgoldstein commented 3 years ago

Done...