vnummela commented 10 years ago

Bacports.lzma fails to decompress some files, even though a direct call to XZ Utils will process the same files without complaints.

System details: OS X 10.9.3 Python 2.7.7 via Anaconda 2.0.0 backports.lzma 0.0.2 (Just noticed there is a new version, but couldn't be bothered with the installation if it is only for the unicode support.)

Unfortunately I don't have Py3 installed and cannot therefore tell if this is an issue of the backported or the original version. I'm putting the issue here in order to get started somewhere.

Example data files: This one fails: https://dl.dropboxusercontent.com/u/90169773/lzma_issue/22h_ticks_bad.bi5 This one is processed without errors: https://dl.dropboxusercontent.com/u/90169773/lzma_issue/23h_ticks_good.bi5

Attempting to decompress the 'bad' file raises the following error: EOFError: Compressed file ended before the end-of-stream marker was reached.

The example files contain tick data and have been downloaded from the Dukascopy(.com) bank's historical economic data service. This error is relatively rare: Typically there are 1-5 failures per each 1000 data files. So far, all of them have been recoverable with XZ Utils. I have not detected any obvious pattern in which ones fail and which ones won't.

Heres a snippet of my program, including the xz workaround:

from future import print_function, division, absolute_import, unicode_literals import os.path as op import backports.lzma as lz import subprocess

class whatever():

@staticmethod
def decompress(path):
    # Got data at all?
    if not op.isfile(path):
        raise Exception('Cannot find the .bi5 file!')
    try:
        # Regular decompression, ok most of the time
        with lz.open(path.encode('utf-8')) as f:
            filecontent = f.read()
    except EOFError:
        # back-up decompression via calling XZ Utils
        xz = '/usr/local/bin/xz -d --format=lzma --keep --suffix=.bi5 ' + '"' + path + '"'
        print('LZMA failed, falling back to XZ Utils for', path)
        __ = subprocess.call(xz.decode('utf-8'), shell=True)
        # The file should now be decompressed on disk. Strip the suffix and read it.
        with open(path[:-4], mode='r') as f:
            filecontent = f.read()
    return filecontent

peterjc commented 10 years ago

Thanks for the sample files. The good news is I can recreate the error:

$ python
Python 2.7.5 (default, Mar  9 2014, 22:15:05) 
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import backports.lzma as lz
>>> path = "22h_ticks_bad.bi5"
>>> with lz.open(path.encode('utf-8')) as f: filecontent = f.read()
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/backports/lzma/__init__.py", line 287, in read
    return self._read_all()
  File "/Library/Python/2.7/site-packages/backports/lzma/__init__.py", line 236, in _read_all
    while self._fill_buffer():
  File "/Library/Python/2.7/site-packages/backports/lzma/__init__.py", line 223, in _fill_buffer
    raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached

And from my point of view, more good news - the same happens with the standard library shipped with Python:

$ python3.4
Python 3.4.0a4 (default, Nov  4 2013, 14:58:04) 
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import lzma as lz
>>> path = "22h_ticks_bad.bi5"
>>> with lz.open(path.encode('utf-8')) as f: filecontent = f.read()
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/pjcock/lib/python3.4/lzma.py", line 303, in read
    return self._read_all()
  File "/Users/pjcock/lib/python3.4/lzma.py", line 244, in _read_all
    while self._fill_buffer():
  File "/Users/pjcock/lib/python3.4/lzma.py", line 225, in _fill_buffer
    raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached

i.e. This isn't a problem specific in the backport.

While this could be a bug in Python's lzma code, from the error message I suspect the problem files are incomplete, and have been truncated (partial downloads). I would guess the Python code is simply being stricter than the XZ Utils, which may just decompress all that it can. If I am right, then you should find re-downloading the files fixes things. However, it could be a problem with the file creation.

If you want to file a bug report with Python.org could you post a link to it here please?

vnummela commented 10 years ago

Hi Peter,

Thanks for confirming the problem. You raise a good point about the integrity of the original files. I did some further testing, but could not find anything to suggest that the data files would have been compromised:

The length of the decompressed file is always a multiple of 20 bytes = the size of the tick data struct.
Time stamp of the last tick in the file is nearly always very close to the end of the one-hour period that each file is supposed to cover.
I also re-downloaded one of the files and got a file of identical length (didn't test for full equality though).

I will submit a bug report with Python.org and cross-link. Thanks for your help!

peterjc commented 10 years ago

Hmm. Using md5sum or even diff would be a quick way to check the re-downloaded files are identical.

This suggests if there is a truncation problem, it isn't during the download but inside the server. Perhaps there is something more subtle happening here? The details of the XY format are not so fresh in my mind as when I wrote http://blastedbio.blogspot.co.uk/2013/04/random-access-to-blocked-xz-format-bxzf.html - and I don't have time right now to dig into this :(

vnummela commented 10 years ago

I tried md5 and got identical hashes.

Nice blog post, but unfortunately the compression format is the older LZMA, not XZ.

In general, Dukascopy is universally hailed for the quality of their tick data. Loads of people use it every day, applications have been built around it, many analysis platforms at least support it. I doubt a file integrity issue would have gone unnoticed (if it originates from them). If it is something subtle, it would need to be something REALLY subtle.

vnummela commented 10 years ago

http://bugs.python.org/issue21872

kenorb commented 9 years ago

The same in here. See the following Travis build which fails when I'm trying to merge FX31337/FX-BT-Scripts/pull/15

This happens on this file each time.

It fails with Python 3.5 (small buffers like 128, 255, 1023, etc.) , but it seems to work in Python 3.4 with lzma._BUFFER_SIZE = 1023.

giuse88 commented 8 years ago

Check this out :

http://stackoverflow.com/questions/37400583/python-lzma-compressed-data-ended-before-the-end-of-stream-marker-was-reached/37400585#37400585

@kenorb

peterjc commented 6 years ago

Sadly there does not seem to be any active work happening on the upstream issue http://bugs.python.org/issue21872

Assuming that does get fixed, we can apply the fix to the backport here.

peterjc commented 5 years ago

This looks to have been fixed now in the standard library: https://github.com/python/cpython/pull/14048

peterjc commented 5 years ago

40 merged, and released as v0.0.14 - thank you @animalize 👍

peterjc / backports.lzma

lzma sometimes fails to decompress a file #6

40 merged, and released as v0.0.14 - thank you @animalize 👍