tuupola / base85

Base85 encoder and decoder for arbitrary data
MIT License
27 stars 3 forks source link

ASCII85. Four zero bytes encoded as !!!!! instead of z #21

Open distlibs opened 3 years ago

distlibs commented 3 years ago

Four zero bytes encoded as !!!!! instead of z.

$ascii85 = new Base85([
    "characters" => Base85::ASCII85,
    "compress.spaces" => false,
    "compress.zeroes" => true
]);
print $ascii85->encode("\0\0\0\0"); // !!!!!

I tested with https://cryptii.com/pipes/ascii85-encoding

ascii85

I tested with Python too. base64.a85encode outputs z for four zero bytes.

tuupola commented 3 years ago

It is intentional, the z compression does not apply to the final block. This is because the input string is padded with 0x00 to be multiple of 4 and we need to be able to distinguish if the final four zero bytes are padding or actual data.

For example if we have data: 0xaabbccddee

The padded four byte blocks it would be: 0xaabbccdd 0xee000000

$ascii85->encode(hex2bin("aabbccddee"));
/* Wk6L2mJ */
bin2hex($ascii85->decode("Wk6L2mJ"));
/* aabbccddee */

If however the data was: 0xaabbccdd00

The padded four byte blocks it would be: 0xaabbccdd 0x00000000

With current behaviour the z compression is not added to the last block:

$ascii85->encode(hex2bin("aabbccdd00"));
/* Wk6L2!! */
print bin2hex($ascii85->decode("Wk6L2!!"));
/* aabbccdd00 */

However if the z compression was also applied to the last block the decoder could not anymore know which zero bytes are padding and which are data. You can test this by commenting out these lines.

$ascii85->encode(hex2bin("aabbccdd00"));
/* Wk6L2z */
print bin2hex($ascii85->decode("Wk6L2z"));
/* aabbccdd00000000 */

You can also see the Cryptii page has the wrong result with aabbccdd00 input.

distlibs commented 3 years ago

Where you found this "the z compression is not added to the last block". I want to read.

tuupola commented 3 years ago

It is described at least in Adobe documents Document management — Portable document format — Part 1: PDF 1.7 and PostScript® LANGUAGE REFERENCE third edition. The interesting parts are:

"If the length of the data to be encoded is not a multiple of 4 bytes, the last, partial group of 4 shall be used to produce a last, partial group of 5 output characters. Given n (1, 2, or 3) bytes of binary data, the encoder shall first append 4 - n zero bytes to make a complete group of 4. It shall encode this group in the usual way, but shall not apply the special z case. Finally, it shall write only the first n + 1 characters of the resulting group of 5. These characters shall be immediately followed by the ~> EOD marker."

and

"If the ASCII85Encode filter is closed when the number of characters written to it is not a multiple of 4, it uses the characters of the last, partial 4-tuple to produce a last, partial 5-tuple of output. Given n (1, 2, or 3) bytes of binary data, it first appends 4 − n zero bytes to make a complete 4-tuple. Then, it encodes the 4-tuple in the usual way, but without applying the z special case. Finally, it writes the first n + 1 bytes of the resulting 5-tuple. Those bytes are followed immediately by the ~> EOD marker. This information is sufficient to correctly encode the number of final bytes and the values of those bytes. "