Closed generalmimon closed 1 year ago
Good catch, yes this reproduces
>>> from range_streams.codecs import TarStream
>>> url = "https://cdn.watchguard.com/SoftwareCenter/Files/WSM/2_2_1/watchguard-dimension_2_2_1.ova"
>>> tar_stream = TarStream(url=url)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/louis/dev/range-streams/src/range_streams/codecs/tar/stream.py", line 114, in __init__
self.check_header_recs()
File "/home/louis/dev/range-streams/src/range_streams/codecs/tar/stream.py", line 136, in check_header_recs
file_size = self.read_file_size(start_pos_offset=scan_tell)
File "/home/louis/dev/range-streams/src/range_streams/codecs/tar/stream.py", line 187, in read_file_size
file_size = int(file_size_b, 8) # convert octal number from bitstring
ValueError: invalid literal for int() with base 8: b'00000057115\x00'
>>>
The test case is:
from range_streams.codecs import TarStream
data_dir_URL = "https://github.com/lmmx/range-streams/raw/master/data/"
EXAMPLE_TAR_URL = f"{data_dir_URL}data.tar"
tar_stream = TarStream(url=EXAMPLE_TAR_URL)
I also found another tarball which reproduces the error:
tar_stream = TarStream(url="https://storage.googleapis.com/kubernetes-release/gci-mounter/mounter.tar")
Breakpointing there in each case shows that in the working example case the file_size_b
is:
>>> from range_streams.codecs import TarStream
>>> tar_stream = TarStream(url="https://github.com/lmmx/range-streams/raw/master/data/data.tar")
> /home/louis/dev/range-streams/src/range_streams/codecs/tar/stream.py(193)read_file_size()
-> return file_size
(Pdb) p file_size
5124
(Pdb) p file_size_b
b'00000012004 '
(Pdb) p len(file_size_b)
12
Whereas in both of the failure cases its value is 0
Your tarball:
>>> tar_stream = TarStream(url="https://cdn.watchguard.com/SoftwareCenter/Files/WSM/2_2_1/watchguard-dimension_2_2_1.ova")
> /home/louis/dev/range-streams/src/range_streams/codecs/tar/stream.py(193)read_file_size()
-> return file_size
(Pdb) p file_size_b
b'00000057115\x00'
(Pdb) p int(file_size_b.rstrip(b"\x00"), 8)
24141
(Pdb) p start_pos_offset
0
Kubernetes tarball:
>>> tar_stream = TarStream(url="https://storage.googleapis.com/kubernetes-release/gci-mounter/mounter.tar")
> /home/louis/dev/range-streams/src/range_streams/codecs/tar/stream.py(193)read_file_size()
-> return file_size
(Pdb) p file_size_b
b'00000000000\x00'
(Pdb) p file_size_b.rstrip(b"\x00")
b'00000000000'
(Pdb) p int(file_size_b.rstrip(b"\x00"), 8)
0
(Pdb) p start_pos_offset
0
Now the file size offset is 0, so the range is being taken at position 124 for 12 bytes.
(Pdb) p file_size_rng
Range[124, 136)
as confirmed from Wikipedia: https://en.wikipedia.org/wiki/Tar_(computing)#Header
Field offset Field size Field
0 100 File name
100 8 File mode (octal)
108 8 Owner's numeric user ID (octal)
116 8 Group's numeric user ID (octal)
124 12 File size in bytes (octal)
136 12 Last modification time in numeric Unix time format (octal)
148 8 Checksum for header record
156 1 Link indicator (file type)
157 100 Name of linked file
I'm a bit stumped by this! I began downloading the Kubernetes mounter.tar tarball and confirmed it is indeed not 0 bytes.
One possible answer (though not sure it is correct) is given in the tar
man pages:
The
size
field is the size of the file in bytes; linked files are archived with this field specified as zero.
So a size field of value 0 may indicate "linked files" in the Kubernetes tarball case. This doesn't fit the description of the case you reported though.
I asked ChatGPT (no luck) and GPT-4 gave a futile suggestion. Then I asked it to try again and it replied that one option was:
Encodings: Ensure that the TarStream class handles different character encodings correctly when reading the file size and other header fields. For example, POSIX tar format uses the ASCII encoding, and the file size field is stored as a null-terminated octal string.
This matches the description, and implies rstripping the file_size_b
is indeed the correct approach.
From that, we might expect the fix to be simply adding this rstrip(b"\x00")
call to the line
try:
file_size = int(file_size_b, 8) # convert octal number from bitstring
except ValueError:
file_size = int(file_size_b.rstrip(b"\x00"), 8) # may be null-terminated
and with that, it works :tada:
>>> from range_streams.codecs import TarStream
>>> url = "https://cdn.watchguard.com/SoftwareCenter/Files/WSM/2_2_1/watchguard-dimension_2_2_1.ova"
>>> tar_stream = TarStream(url=url)
>>> tar_stream
TarStream ⠶ [0, 24141), [25088, 25438), [26112, 128512), [129024, 960447488), [960448000, 960520704) @@ 'watchguard-dimension_2_2_1.ova' from cdn.watchguard.com
I'll ship the fix now, thanks for reporting.
:ship: 571433a
@lmmx:
try: file_size = int(file_size_b, 8) # convert octal number from bitstring except ValueError: file_size = int(file_size_b.rstrip(b"\x00"), 8) # may be null-terminated
and with that, it works 🎉
Yep, should work for the vast majority of cases. You could do file_size = int(file_size_b.rstrip(b"\x00"), 8)
directly, because rstrip
will just return the string unchanged if it doesn't end with \x00
, and int(..., 8)
will always fail if it contains unknown digits for the given base (in this case 8
), but that's a detail.
The only potential discrepancy I can see is that rstrip
doesn't really handle "null-terminated" strings in the sense of the C language semantics (I guess this also applies to filenames, for example), which the format authors most likely had in mind. In C, a character string is represented just as a pointer to the beginning and its length is only determined by scanning it byte by byte from the start until the NUL byte ('\0'
) is found. But rstrip
doesn't do that - it only strips the NUL bytes from the end of the string, but some zero bytes may still remain somewhere in the middle of the string, "protected" by other bytes from being reached when scanning from the end. So for such archives, you'd be probably getting different values than other applications working with the tar format.
Oh! I didn't think of that. So I suppose file_size_b[:file_size_b.index(b"\x00")]
would work then?
So I suppose
file_size_b[:file_size_b.index(b"\x00")]
would work then?
Yes. But bytes.index
raises ValueError
if the searched sequence is not found, so now it really makes sense to keep the try..except
statement and apply the null termination only if int()
on the original file_size_b
fails.
And as I mentioned, this should be probably used for other fields (like file names) too.
On Windows:
Same in WSL: