Open c86c5cee-fab9-4e48-a7b7-5396d6b1d4e5 opened 3 years ago
Hello,
The lzma module does not works well with XZ stream padding. Depending on the case, it may work; or it may stops the stream prematurely without error; or an error may be raised; or no error may be raised when it must.
In the XZ file format, stream padding is a number of null bytes (multiple of 4) that can be between and after streams.
From the specification (section 2.2):
Only the decoders that support decoding of concatenated Streams MUST support Stream Padding.
Since the lzma module supports decoding of concatenated streams, it must support stream padding as well.
#### Examples to reproduce the issue:
>>> with lzma.open('/example1.xz', format=lzma.FORMAT_AUTO) as f:
... f.read()
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/lib/python3.9/lzma.py", line 200, in read
return self._buffer.read(size)
File "/usr/lib/python3.9/_compression.py", line 99, in read
raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached
>>> with lzma.open('/example1.xz', format=lzma.FORMAT_XZ) as f:
... f.read()
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/lib/python3.9/lzma.py", line 200, in read
return self._buffer.read(size)
File "/usr/lib/python3.9/_compression.py", line 99, in read
raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached
>>> with lzma.open('/tmp/example2.xz', format=lzma.FORMAT_AUTO) as f:
... f.read()
...
b'Hi!\nSecond stream\n'
>>> with lzma.open('/tmp/example2.xz', format=lzma.FORMAT_XZ) as f:
... f.read()
...
b'Hi!\n'
#### Analysis
This issue comes from the relation between _lzma and _compression. In _lzma, the C library is called without the LZMA_CONCATENATED flag, which means that multiple streams and stream padding must be supported in Python.
In _compression, when a LZMADecompressor is done (.eof is True), an other one is created to decompress from that point. If the new one fails to decompress the remaining data, the LZMAError is ignored and we assume we reached the end.
So the behavior seen above can be explained as follows:
#### Possible solution
A possible solution would be to add a finish method on the decompressor interface, and support it appropriately in _compression when we reached EOF on the input. Then, in LZMADecompressor implementation, use the LZMA_CONCATENATED flag, and implement the finish method to call lzma_code with LZMA_FINISH as action.
I think this would be preferred than trying to solve the issue in Python, because if the format is FORMAT_AUTO we don't know if the format is XZ (and we should support stream padding) or not.
It must be decided what to do in the following cases, which are not valid per the XZ file specification, but supported by the lzma module (and tested):
The answer may be different depending on the format arg (e.g. FORMAT_AUTO vs FORMAT_XZ).
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at = None created_at =
labels = ['type-bug', '3.8', '3.9', '3.10', '3.11', '3.7', 'library']
title = 'lzma: stream padding in xz files'
updated_at =
user = 'https://github.com/rogdham'
```
bugs.python.org fields:
```python
activity =
actor = 'rogdham'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation =
creator = 'rogdham'
dependencies = []
files = ['50044', '50045']
hgrepos = []
issue_num = 44134
keywords = []
message_count = 2.0
messages = ['393681', '393738']
nosy_count = 3.0
nosy_names = ['nadeem.vawda', 'malin', 'rogdham']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue44134'
versions = ['Python 3.6', 'Python 3.7', 'Python 3.8', 'Python 3.9', 'Python 3.10', 'Python 3.11']
```