Question about padding and the file format

mxmlnkn commented 1 year ago

Hello,

Someone requested support for securetar archives for ratarmount. That got me to look into the file format.

Why was it decided to only call padding.PKCS7 in some circumstances instead of just always?

If it had been always called, then the PKCS method would have ensured that at least 1 padding byte is added. This is important because it makes removing the padding deterministic and easy.

In the current state, it is impossible to remove the padding with certainty because it might be that no padding was added and that the encrypted data just looks like padding, e.g., it ends with a 0x01 byte or two 0x02 0x02 bytes. Therefore, there is at least a 1 in 256*16 chance for a false positive when trying to remove the padding.

For uncompressed bytes, it doesn't even matter because a TAR file consists of 512 B blocks but for compressed TARs it does matter. Note that gzip and bzip2 allow multiple streams to be concatenated to each other. Therefore, gzip and bzip2 decompressors will try to keep reading the padding bytes and should give a warning if they are garbage.

Is it possible to still change this in the format specification? Some magic bytes for file format identification also would have been nice.

mxmlnkn commented 1 year ago

It gets weirder. It seems that there is multiple padding. This means that depending on the number of write calls, there can be arbitrary many padding. I think this is a bug.

I have edited securetar/__init__.py like this:

    def write(self, data: bytes) -> None:
        """Write data."""
        print(f"Should write {len(data)} B of data")
        if len(data) % BLOCK_SIZE != 0:
            padder = padding.PKCS7(BLOCK_SIZE_BITS).padder()
            data = padder.update(data) + padder.finalize()

        print(f"Padded data to {len(data)} B")
        encrypted = self._encrypt.update(data)
        print(f"Encrypted padded data sized {len(encrypted)} B")
        os.write(self._file, encrypted)

I have created a setup like this:

mkdir -p a/b
echo pragzip > a/b/pragzip
python3 -c '
import pathlib; import securetar
with securetar.SecureTarFile("securetar.tar.aes", "w", b"passwordpasswordpasswordpassword") as tarFile:
    securetar.atomic_contents_add(tarFile, pathlib.Path("a/"), excludes=[], arcname=".")
'

The debug output is something like this:

Should write 254 B of data
Padded data to 256 B
Encrypted padded data sized 256 B

Should write 4 B of data
Padded data to 16 B
Encrypted padded data sized 16 B

Should write 4 B of data
Padded data to 16 B
Encrypted padded data sized 16 B

I noticed this because the decrypted file has a broken gzip footer, i.e., the checksum size modulo 2^32 and the CRC32 are both wrong.

The decrypted output looks something like this:

python3 -c '
import pathlib; import securetar
sf = securetar.SecureTarFile("securetar.tar.aes", "r", b"passwordpasswordpasswordpassword")
sf.__enter__()
print(sf.read(1000))
'

Decrypted (32 B per line):

        Gzip Header
<------------------------------>
1F 8B 08 08 94 24 1E 64 02 FF 00|ED D4 CB 4A C4  40 10 85 E1 5E FB 14 79 82 4E 55 75 F5 6D 21 B8
74 E9 2B B4 18 34 8B 81 10 33 30 FA F4 76 33 E2  65 31 B8 9A A8 93 F3 6D 12 92 40 20 E1 FC B6 B7
FD CD 5D 39 DC 0E E5 61 98 CD 59 D0 D1 A9 23 91  D3 CF F3 76 9D 49 58 4C 77 30 2B D8 3F 2F 65 AE
AF 37 DB 24 A9 DB 2D E3 6E B8 E6 10 7D A0 5C 3F  BD 4D 59 34 A8 5E 19 B8 78 B6 3F FF 3B DA A8 E3

71 EF 1C FD B7 E3 C7 E6 59 5D 88 51 A5 DE A9 FB  67 AE 49 E8 FC 9A FB 7F 7A 99 CA 32 96 93 CF FD
74 FF DF FE 7F F4 1F FD FF D2 7F 16 9B 03 71 4A  01 FD DF 80 FB BF D6 7F C7 AD FF C2 E8 3F FA 8F
FE AF DA FF D6 7D F1 C9 FA AC 1A B2 43 FF 37 D1  FF 69 2E 8F AF E3 F4 9B FD E7 D6 7F A5 C8 12 D9
BB BA 7F 09 54 F7 4F 6B EE 7F A3 FD 7F FF FB D8  3A 00 00 00 00 00 00 00 00 00 C0 85 79 03|02 02
                                                                                         / <--->
                                                              detected deflate block end  Padding
19 B2 26 AE 0C 0C 0C 0C 0C 0C 0C 0C 0C 0C 0C 0C  00 28 00 00 0C 0C 0C 0C 0C 0C 0C 0C 0C 0C 0C 0C
<---------> <--------------------------------->  <---------> <--------------------------------->
Gzip CRC32           Second Padding               Gzip Size            Third Padding

As far as I can see, the decryption / read method does not remove either padding.

The size 00 28 00 00 = 10240 B fits the usual TAR archive size. Simply removing the 0C run still does not fix the gzip archive. The 00 00 00 00 00 00 00 00 00 run looks very suspicious. It doesn't look like compressed data. But it is. Deflate can only compress so much at once (runs of 256 zeros) and if the runs are longer, the compressed bits will look repeated.

pragzip --analyze yields:

Gzip header:
    Gzip Stream Count   : 1
    Compressed Offset   : 0 B 0 b
    Uncompressed Offset : 0 B
    File Name           : 
    Modification Time   : 1679697044
    OS                  : unknown
    Flags               : compressor used maximum compression, slowest algorithm

Deflate block:
    Final Block             : True
    Compression Type        : Dynamic Huffman
    File Statistics:
        Total Block Count   : 1
        Compressed Offset   : 11 B 0 b
        Uncompressed Offset : 0 B
    Gzip Stream Statistics:
        Block Count         : 1
        Compressed Offset   : 11 B 0 b
        Uncompressed Offset : 0 B
    Compressed Size         : 242 B 3 b
    Uncompressed Size       : 10240 B
    Compression Ratio       : 42.2486
    Huffman Alphabets:
        Precode  : 12 CLs in [2, 6] out of 18: CL:Count, 0:7, 2:2, 3:1, 4:4, 5:3, 6:2, 
        Distance : 11 CLs in [1, 6] out of 21: CL:Count, 0:10, 1:1, 2:1, 3:1, 6:8, 
        Literals : 54 CLs in [3, 8] out of 286: CL:Count, 0:232, 3:1, 4:4, 5:11, 6:9, 7:7, 8:22, 
    Symbol Types:
        Literal         : 133 (66.8342 %)
        Back-References : 66 (33.1658 %)

Reading gzip footer
Gzip footer:
    Decompressed Size % 2^32  : 202157606
    CRC32                     : 0xb2190202
Caught exception:
Mismatching size (10240 <-> footer: 202157606) for gzip stream!, typeid: St13runtime_error

So, the first Deflate block should end at / the gzip footer should start at offset 11 B + 242 B 3 b + 7b = 254 B.

After removing all three paddings, it works. But I don't see how I would be able to do that programmatically without knowing the size of each write during archive creation?!

pvizeli commented 1 year ago

I do not have time to investigate. But I'm open to suggestions. However, we can't break supporting existing created files with this library.

mxmlnkn commented 1 year ago

Here is an example showing that adding padding in every write call is a bug and leads to corrupted files in the case of encrypted gzip-compressed tar files:

import os
import pathlib
import securetar

os.makedirs("foo", exist_ok=True)

with open("foo/mimi", "wb") as file:
    file.write(os.urandom(12345))

with open("foo/momo", "wb") as file:
    file.write(b"test")

with open("foo/mama", "wb") as file:
    file.write(os.urandom(12345))

with securetar.SecureTarFile("test.tar.gz", "w", b"1234567890abcdef", gzip = True, bufsize = 333) as tar_file:
    securetar.atomic_contents_add(
        tar_file,
        pathlib.Path("foo"),
        excludes=[],
        arcname=".",
    )

with securetar.SecureTarFile("test.tar.gz", "r", b"1234567890abcdef", gzip = True) as tar_file:
    for tarInfo in tar_file:
        print("tarInfo:", tarInfo)

Note the bufsize argument to SecureTar, which is not divisible by 16.

The output is:

tarInfo: <TarInfo '.' at 0x7fd16b4b5a80>
tarInfo: <TarInfo 'mimi' at 0x7fd16b4b5b40>
Traceback (most recent call last):
  File "test-securetar.py", line 26, in <module>
    for tarInfo in tar_file:
  File "/usr/lib/python3.10/tarfile.py", line 2475, in __iter__
    tarinfo = self.next()
  File "/usr/lib/python3.10/tarfile.py", line 2370, in next
    raise ReadError(str(e)) from None
tarfile.ReadError: bad checksum

The debug output to show the problem with the padding in every write call is:

Write data sized 333
  -> padded: 336
Write data sized 333
  -> padded: 336
[... Repeated 46x ...]
Write data sized 333
  -> padded: 336
Write data sized 8826
  -> padded: 8832
Write data sized 4
  -> padded: 16
Write data sized 4
  -> padded: 16

With the default buffer size, the file contents are fortunately not corrupted because tarfile only calls write on the file object with the buffer size or with the remaining data when closing the output stream. That buffer size is 10240 and divisible by 16 and therefore no padding will be added.

In my opinion the Write data sized 8826 line is another bug inside tarfile. It should only write 333 B at maximum when the buffer size is configured to be 333 B.

However, paddings will always be added for the two 4-byte writes for the CRC32 and the file size modulo 2^32. See also this code in tarfile.py:

                if self.comptype == "gz":
                    self.fileobj.write(struct.pack("<L", self.crc))
                    self.fileobj.write(struct.pack("<L", self.pos & 0xffffFFFF))

These two writes correspond to this debug output:

Write data sized 4
  -> padded: 16
Write data sized 4
  -> padded: 16

This corrupts the gzip footer and it only does not lead to an error because tarfile implements gzip parsing itself (see _init_read_gz inside tarfile.py) and tarfile simply does NOT check the gzip CRC32. This is a bug in my opinion and as soon as this bug is fixed, all AES-encrypted gzip-compressed TAR files generated with securetar will be reported as corrupted.

These would be my proposed changes to securetar:

Do not add padding inside SecureTarFile.write
- For encryption to still work, this might require either checking whether self._aes.update returns none or keep an internal buffer for remainders < 16 B, which should be written inside SecureTarFile.close
Always add padding inside SecureTarFile.__exit__ even when there is no remainder < 16 B.

These changes would be backward compatible so far that all older versions would be able to read such files.

The benefit would be that this changed version would write out valid gzip files that will be readable by other tools or when tarfile suddenly starts verifying the gzip footer.

diff --git a/securetar/__init__.py b/securetar/__init__.py
index 43af262..5eea6ab 100644
--- a/securetar/__init__.py
+++ b/securetar/__init__.py
@@ -54,6 +54,8 @@ class SecureTarFile:
         self._decrypt: Optional[CipherContext] = None
         self._encrypt: Optional[CipherContext] = None

+        self._buffer: bytes = bytes()
+
     def __enter__(self) -> tarfile.TarFile:
         """Start context manager tarfile."""
         if not self._key:
@@ -96,20 +98,33 @@ class SecureTarFile:

     def __exit__(self, exc_type, exc_value, traceback) -> None:
         """Close file."""
+
         if self._tar:
             self._tar.close()
             self._tar = None
+
         if self._file:
+            if not self._mode.startswith("r"):
+                padder = padding.PKCS7(BLOCK_SIZE_BITS).padder()
+                padded = padder.update(self._buffer) + padder.finalize()
+                os.write(self._file, self._encrypt.update(padded))
+
             os.close(self._file)
             self._file = None

     def write(self, data: bytes) -> None:
         """Write data."""
-        if len(data) % BLOCK_SIZE != 0:
-            padder = padding.PKCS7(BLOCK_SIZE_BITS).padder()
-            data = padder.update(data) + padder.finalize()
-
-        os.write(self._file, self._encrypt.update(data))
+        self._buffer += data
+        if len(self._buffer) < BLOCK_SIZE:
+            return
+
+        remainder = len(self._buffer) % BLOCK_SIZE
+        if remainder == 0:
+            os.write(self._file, self._encrypt.update(self._buffer))
+            self._buffer = bytes()
+        else:
+            os.write(self._file, self._encrypt.update(self._buffer[:-remainder]))
+            self._buffer = self._buffer[-remainder:]

     def read(self, size: int = 0) -> bytes:
         """Read data."""

I would also recommend adding tests. Then, you would see that this change creates backward-compatible files.

bdraco commented 9 months ago

Since it seems like you have a solution to this issue, it probably makes sense to open a PR at this point.

mxmlnkn commented 8 months ago

@bdraco I'm sorry, but I probably won't get around to doing it. I looked at this almost a year ago and the time to get into it again, fix it in the unknown source code base, add tests, convince the code owner, and other things, takes too much time. I was only looking at this project because there was a single request to support the file format from ratarmount. If there is more demand, I might look into this again, though.

bdraco commented 8 months ago

No worries, we can leave this open and if you have a chance, you can always come back to it later.

pvizeli / securetar

Question about padding and the file format #19