Closed vnummela closed 5 years ago
Thanks for the sample files. The good news is I can recreate the error:
$ python
Python 2.7.5 (default, Mar 9 2014, 22:15:05)
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import backports.lzma as lz
>>> path = "22h_ticks_bad.bi5"
>>> with lz.open(path.encode('utf-8')) as f: filecontent = f.read()
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/backports/lzma/__init__.py", line 287, in read
return self._read_all()
File "/Library/Python/2.7/site-packages/backports/lzma/__init__.py", line 236, in _read_all
while self._fill_buffer():
File "/Library/Python/2.7/site-packages/backports/lzma/__init__.py", line 223, in _fill_buffer
raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached
And from my point of view, more good news - the same happens with the standard library shipped with Python:
$ python3.4
Python 3.4.0a4 (default, Nov 4 2013, 14:58:04)
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import lzma as lz
>>> path = "22h_ticks_bad.bi5"
>>> with lz.open(path.encode('utf-8')) as f: filecontent = f.read()
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/pjcock/lib/python3.4/lzma.py", line 303, in read
return self._read_all()
File "/Users/pjcock/lib/python3.4/lzma.py", line 244, in _read_all
while self._fill_buffer():
File "/Users/pjcock/lib/python3.4/lzma.py", line 225, in _fill_buffer
raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached
i.e. This isn't a problem specific in the backport.
While this could be a bug in Python's lzma code, from the error message I suspect the problem files are incomplete, and have been truncated (partial downloads). I would guess the Python code is simply being stricter than the XZ Utils, which may just decompress all that it can. If I am right, then you should find re-downloading the files fixes things. However, it could be a problem with the file creation.
If you want to file a bug report with Python.org could you post a link to it here please?
Hi Peter,
Thanks for confirming the problem. You raise a good point about the integrity of the original files. I did some further testing, but could not find anything to suggest that the data files would have been compromised:
I will submit a bug report with Python.org and cross-link. Thanks for your help!
Hmm. Using md5sum
or even diff
would be a quick way to check the re-downloaded files are identical.
This suggests if there is a truncation problem, it isn't during the download but inside the server. Perhaps there is something more subtle happening here? The details of the XY format are not so fresh in my mind as when I wrote http://blastedbio.blogspot.co.uk/2013/04/random-access-to-blocked-xz-format-bxzf.html - and I don't have time right now to dig into this :(
I tried md5
and got identical hashes.
Nice blog post, but unfortunately the compression format is the older LZMA, not XZ.
In general, Dukascopy is universally hailed for the quality of their tick data. Loads of people use it every day, applications have been built around it, many analysis platforms at least support it. I doubt a file integrity issue would have gone unnoticed (if it originates from them). If it is something subtle, it would need to be something REALLY subtle.
Sadly there does not seem to be any active work happening on the upstream issue http://bugs.python.org/issue21872
Assuming that does get fixed, we can apply the fix to the backport here.
This looks to have been fixed now in the standard library: https://github.com/python/cpython/pull/14048
Bacports.lzma fails to decompress some files, even though a direct call to XZ Utils will process the same files without complaints.
System details: OS X 10.9.3 Python 2.7.7 via Anaconda 2.0.0 backports.lzma 0.0.2 (Just noticed there is a new version, but couldn't be bothered with the installation if it is only for the unicode support.)
Unfortunately I don't have Py3 installed and cannot therefore tell if this is an issue of the backported or the original version. I'm putting the issue here in order to get started somewhere.
Example data files: This one fails: https://dl.dropboxusercontent.com/u/90169773/lzma_issue/22h_ticks_bad.bi5 This one is processed without errors: https://dl.dropboxusercontent.com/u/90169773/lzma_issue/23h_ticks_good.bi5
Attempting to decompress the 'bad' file raises the following error: EOFError: Compressed file ended before the end-of-stream marker was reached.
The example files contain tick data and have been downloaded from the Dukascopy(.com) bank's historical economic data service. This error is relatively rare: Typically there are 1-5 failures per each 1000 data files. So far, all of them have been recoverable with XZ Utils. I have not detected any obvious pattern in which ones fail and which ones won't.
Heres a snippet of my program, including the xz workaround:
from future import print_function, division, absolute_import, unicode_literals import os.path as op import backports.lzma as lz import subprocess
class whatever():