lmmx / range-streams

Streaming range requests in Python
https://range-streams.readthedocs.io/en/latest/
MIT License
8 stars 0 forks source link

codecs.tar.TarStream: `ValueError: invalid literal for int() with base 8: b'00000057115\x00'` #41

Closed generalmimon closed 1 year ago

generalmimon commented 1 year ago

On Windows:

$ python -c 'import platform; print(platform.python_version()); print(platform.uname())'
3.10.10
uname_result(system='Windows', node='DESKTOP-89OPGF3', release='10', version='10.0.19044', machine='AMD64')

$ python -m pip show range-streams
Name: range-streams
Version: 1.2.11
Summary: Streaming via range requests.
Home-page: https://github.com/lmmx/range-streams
Author: Louis Maddox
Author-email: louismmx@gmail.com
License: MIT
Location: c:\users\pp\appdata\local\programs\python\python310\lib\site-packages
Requires: aiostream, httpx, python-ranges, pyzstd, tqdm
Required-by:

$ python -c 'from range_streams.codecs import TarStream; tar_stream = TarStream(url="https://cdn.watchguard.com/SoftwareCenter/Files/WSM/2_2_1/watchguard-dimension_2_2_1.ova")'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\pp\AppData\Local\Programs\Python\Python310\lib\site-packages\range_streams\codecs\tar\stream.py", line 114, in __init__
    self.check_header_recs()
  File "C:\Users\pp\AppData\Local\Programs\Python\Python310\lib\site-packages\range_streams\codecs\tar\stream.py", line 136, in check_header_recs
    file_size = self.read_file_size(start_pos_offset=scan_tell)
  File "C:\Users\pp\AppData\Local\Programs\Python\Python310\lib\site-packages\range_streams\codecs\tar\stream.py", line 187, in read_file_size
    file_size = int(file_size_b, 8)  # convert octal number from bitstring
ValueError: invalid literal for int() with base 8: b'00000057115\x00'

Same in WSL:

pp@DESKTOP-89OPGF3:~$ python3 -c 'import platform; print(platform.python_version()); print(platform.uname())'
3.10.6
uname_result(system='Linux', node='DESKTOP-89OPGF3', release='5.15.90.1-microsoft-standard-WSL2', version='#1 SMP Fri Jan 27 02:56:13 UTC 2023', machine='x86_64')
pp@DESKTOP-89OPGF3:~$ python3 -m pip show range-streams
Name: range-streams
Version: 1.2.11
Summary: Streaming via range requests.
Home-page: https://github.com/lmmx/range-streams
Author: Louis Maddox
Author-email: louismmx@gmail.com
License: MIT
Location: /home/pp/.local/lib/python3.10/site-packages
Requires: aiostream, httpx, python-ranges, pyzstd, tqdm
Required-by:
pp@DESKTOP-89OPGF3:~$ python3 -c 'from range_streams.codecs import TarStream; tar_stream = TarStream(url="https://cdn.watchguard.com/SoftwareCenter/Files/WSM/2_2_1/watchguard-dimension_2_2_1.ova")'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/pp/.local/lib/python3.10/site-packages/range_streams/codecs/tar/stream.py", line 114, in __init__
    self.check_header_recs()
  File "/home/pp/.local/lib/python3.10/site-packages/range_streams/codecs/tar/stream.py", line 136, in check_header_recs
    file_size = self.read_file_size(start_pos_offset=scan_tell)
  File "/home/pp/.local/lib/python3.10/site-packages/range_streams/codecs/tar/stream.py", line 187, in read_file_size
    file_size = int(file_size_b, 8)  # convert octal number from bitstring
ValueError: invalid literal for int() with base 8: b'00000057115\x00'
lmmx commented 1 year ago

Good catch, yes this reproduces

>>> from range_streams.codecs import TarStream
>>> url = "https://cdn.watchguard.com/SoftwareCenter/Files/WSM/2_2_1/watchguard-dimension_2_2_1.ova"
>>> tar_stream = TarStream(url=url)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/louis/dev/range-streams/src/range_streams/codecs/tar/stream.py", line 114, in __init__
    self.check_header_recs()
  File "/home/louis/dev/range-streams/src/range_streams/codecs/tar/stream.py", line 136, in check_header_recs
    file_size = self.read_file_size(start_pos_offset=scan_tell)
  File "/home/louis/dev/range-streams/src/range_streams/codecs/tar/stream.py", line 187, in read_file_size
    file_size = int(file_size_b, 8)  # convert octal number from bitstring
ValueError: invalid literal for int() with base 8: b'00000057115\x00'
>>> 

The test case is:

from range_streams.codecs import TarStream

data_dir_URL = "https://github.com/lmmx/range-streams/raw/master/data/"
EXAMPLE_TAR_URL = f"{data_dir_URL}data.tar"
tar_stream = TarStream(url=EXAMPLE_TAR_URL)

I also found another tarball which reproduces the error:

tar_stream = TarStream(url="https://storage.googleapis.com/kubernetes-release/gci-mounter/mounter.tar")

Breakpointing there in each case shows that in the working example case the file_size_b is:

>>> from range_streams.codecs import TarStream
>>> tar_stream = TarStream(url="https://github.com/lmmx/range-streams/raw/master/data/data.tar")
> /home/louis/dev/range-streams/src/range_streams/codecs/tar/stream.py(193)read_file_size()
-> return file_size
(Pdb) p file_size
5124
(Pdb) p file_size_b
b'00000012004 '
(Pdb) p len(file_size_b)
12

Whereas in both of the failure cases its value is 0

Your tarball:

>>> tar_stream = TarStream(url="https://cdn.watchguard.com/SoftwareCenter/Files/WSM/2_2_1/watchguard-dimension_2_2_1.ova")
> /home/louis/dev/range-streams/src/range_streams/codecs/tar/stream.py(193)read_file_size()
-> return file_size
(Pdb) p file_size_b
b'00000057115\x00'
(Pdb) p int(file_size_b.rstrip(b"\x00"), 8)
24141
(Pdb) p start_pos_offset
0

Kubernetes tarball:

>>> tar_stream = TarStream(url="https://storage.googleapis.com/kubernetes-release/gci-mounter/mounter.tar")
> /home/louis/dev/range-streams/src/range_streams/codecs/tar/stream.py(193)read_file_size()
-> return file_size
(Pdb) p file_size_b
b'00000000000\x00'
(Pdb) p file_size_b.rstrip(b"\x00")
b'00000000000'
(Pdb) p int(file_size_b.rstrip(b"\x00"), 8)
0
(Pdb) p start_pos_offset
0

Now the file size offset is 0, so the range is being taken at position 124 for 12 bytes.

(Pdb) p file_size_rng
Range[124, 136)

as confirmed from Wikipedia: https://en.wikipedia.org/wiki/Tar_(computing)#Header

Field offset    Field size  Field
0   100     File name
100     8   File mode (octal)
108     8   Owner's numeric user ID (octal)
116     8   Group's numeric user ID (octal)
124     12  File size in bytes (octal)
136     12  Last modification time in numeric Unix time format (octal)
148     8   Checksum for header record
156     1   Link indicator (file type)
157     100     Name of linked file 

I'm a bit stumped by this! I began downloading the Kubernetes mounter.tar tarball and confirmed it is indeed not 0 bytes.

lmmx commented 1 year ago

One possible answer (though not sure it is correct) is given in the tar man pages:

The size field is the size of the file in bytes; linked files are archived with this field specified as zero.

So a size field of value 0 may indicate "linked files" in the Kubernetes tarball case. This doesn't fit the description of the case you reported though.

I asked ChatGPT (no luck) and GPT-4 gave a futile suggestion. Then I asked it to try again and it replied that one option was:

Encodings: Ensure that the TarStream class handles different character encodings correctly when reading the file size and other header fields. For example, POSIX tar format uses the ASCII encoding, and the file size field is stored as a null-terminated octal string.

This matches the description, and implies rstripping the file_size_b is indeed the correct approach.

From that, we might expect the fix to be simply adding this rstrip(b"\x00") call to the line

https://github.com/lmmx/range-streams/blob/4d7385e0f71c57486e990af5ae9b94e8ed43c626/src/range_streams/codecs/tar/stream.py#L187

        try:
            file_size = int(file_size_b, 8)  # convert octal number from bitstring
        except ValueError:
            file_size = int(file_size_b.rstrip(b"\x00"), 8)  # may be null-terminated

and with that, it works :tada:

>>> from range_streams.codecs import TarStream
>>> url = "https://cdn.watchguard.com/SoftwareCenter/Files/WSM/2_2_1/watchguard-dimension_2_2_1.ova"
>>> tar_stream = TarStream(url=url)
>>> tar_stream
TarStream ⠶ [0, 24141), [25088, 25438), [26112, 128512), [129024, 960447488), [960448000, 960520704) @@ 'watchguard-dimension_2_2_1.ova' from cdn.watchguard.com

I'll ship the fix now, thanks for reporting.

lmmx commented 1 year ago

:ship: 571433a

lmmx commented 1 year ago

Released in v1.3.0

generalmimon commented 1 year ago

@lmmx:

        try:
            file_size = int(file_size_b, 8)  # convert octal number from bitstring
        except ValueError:
            file_size = int(file_size_b.rstrip(b"\x00"), 8)  # may be null-terminated

and with that, it works 🎉

Yep, should work for the vast majority of cases. You could do file_size = int(file_size_b.rstrip(b"\x00"), 8) directly, because rstrip will just return the string unchanged if it doesn't end with \x00, and int(..., 8) will always fail if it contains unknown digits for the given base (in this case 8), but that's a detail.

The only potential discrepancy I can see is that rstrip doesn't really handle "null-terminated" strings in the sense of the C language semantics (I guess this also applies to filenames, for example), which the format authors most likely had in mind. In C, a character string is represented just as a pointer to the beginning and its length is only determined by scanning it byte by byte from the start until the NUL byte ('\0') is found. But rstrip doesn't do that - it only strips the NUL bytes from the end of the string, but some zero bytes may still remain somewhere in the middle of the string, "protected" by other bytes from being reached when scanning from the end. So for such archives, you'd be probably getting different values than other applications working with the tar format.

lmmx commented 1 year ago

Oh! I didn't think of that. So I suppose file_size_b[:file_size_b.index(b"\x00")] would work then?

generalmimon commented 1 year ago

So I suppose file_size_b[:file_size_b.index(b"\x00")] would work then?

Yes. But bytes.index raises ValueError if the searched sequence is not found, so now it really makes sense to keep the try..except statement and apply the null termination only if int() on the original file_size_b fails.

And as I mentioned, this should be probably used for other fields (like file names) too.