Closed ZeroDesigner closed 3 months ago
Hi @ZeroDesigner, can you provide more details on the versions you have for ord-data and ord-schema? (This is working fine for me.) You should also make sure you're running python 3.10 or later.
ok, I test on several versions.
1 . python =3.7.16, ord-schema=0.3.25
ord-data$ git log
commit 51b1581a9ac9003576db04a4f1ecc2d9fb3e9097 (HEAD -> main, origin/main, origin/HEAD)
Author: Steven Kearnes <skearnes@relaytx.com>
Date: Fri Dec 16 09:45:50 2022 -0500
Bump ord-schema (#157)
* bump ord-schema version
* Bump ord-schema version
* bump ord-schema
* more splits
Co-authored-by: EC2 Default User <ec2-user@ip-172-30-4-170.ec2.internal>
Error:
pb_path_2 = 'ord-data/data/00/ord_dataset-00005539a1e04c809a9a78647bea649c.pb.gz'
data = message_helpers.load_message(pb_path_2, dataset_pb2.Dataset)
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-3-d4377973e480> in <module>
----> 1 data = message_helpers.load_message(pb_path_2, dataset_pb2.Dataset)
python3.7/site-packages/ord_schema/message_helpers.py in load_message(filename, message_type)
737 return text_format.Parse(f.read(), message_type())
738 if input_format == MessageFormat.BINARY:
--> 739 return message_type.FromString(f.read())
740 except (
741 json_format.ParseError,
python3.7/gzip.py in read(self, size)
285 import errno
286 raise OSError(errno.EBADF, "read() on write-only GzipFile object")
--> 287 return self._buffer.read(size)
288
289 def read1(self, size=-1):
python3.7/gzip.py in read(self, size)
472 # jump to the next member, if there is one.
473 self._init_read()
--> 474 if not self._read_gzip_header():
475 self._size = self._pos
476 return b""
python3.7/gzip.py in _read_gzip_header(self)
420
421 if magic != b'\037\213':
--> 422 raise OSError('Not a gzipped file (%r)' % magic)
423
424 (method, flag,
OSError: Not a gzipped file (b've')
python=3.10.11, ord-schema=0.3.53
ord-data$ git log
commit 51b1581a9ac9003576db04a4f1ecc2d9fb3e9097 (HEAD -> main, origin/main, origin/HEAD)
Author: Steven Kearnes <skearnes@relaytx.com>
Date: Fri Dec 16 09:45:50 2022 -0500
Bump ord-schema (#157)
* bump ord-schema version
* Bump ord-schema version
* bump ord-schema
* more splits
Co-authored-by: EC2 Default User <ec2-user@ip-172-30-4-170.ec2.internal>
error:
In [2]: pb_path_2 = 'ord-data/data/00/ord_dataset-00005539a1e04c809a9a78647bea649c.pb.gz'
...:
In [3]: data = message_helpers.load_message(pb_path_2, dataset_pb2.Dataset)
...:
---------------------------------------------------------------------------
BadGzipFile Traceback (most recent call last)
Cell In[3], line 1
----> 1 data = message_helpers.load_message(pb_path_2, dataset_pb2.Dataset)
File python3.10/site-packages/ord_schema/message_helpers.py:763, in load_message(filename, message_type)
761 return text_format.Parse(f.read(), message_type())
762 if input_format == MessageFormat.BINARY:
--> 763 return message_type.FromString(f.read())
764 except (
765 json_format.ParseError,
766 protobuf.message.DecodeError,
767 text_format.ParseError,
768 ) as error:
769 raise ValueError(f"error parsing {filename}: {error}") from error
File python3.10/gzip.py:301, in GzipFile.read(self, size)
299 import errno
300 raise OSError(errno.EBADF, "read() on write-only GzipFile object")
--> 301 return self._buffer.read(size)
File /python3.10/_compression.py:118, in DecompressReader.readall(self)
114 chunks = []
115 # sys.maxsize means the max length of output buffer is unlimited,
116 # so that the whole input buffer can be decompressed within one
117 # .decompress() call.
--> 118 while data := self.read(sys.maxsize):
119 chunks.append(data)
121 return b"".join(chunks)
File python3.10/gzip.py:488, in _GzipReader.read(self, size)
484 if self._new_member:
485 # If the _new_member flag is set, we have to
486 # jump to the next member, if there is one.
487 self._init_read()
--> 488 if not self._read_gzip_header():
489 self._size = self._pos
490 return b""
File python3.10/gzip.py:436, in _GzipReader._read_gzip_header(self)
433 return False
435 if magic != b'\037\213':
--> 436 raise BadGzipFile('Not a gzipped file (%r)' % magic)
438 (method, flag,
439 self._last_mtime) = struct.unpack("<BBIxx", self._read_exact(8))
440 if method != 8:
BadGzipFile: Not a gzipped file (b've')
I'm still not able to reproduce this. What happens if you gunzip
the file and then try to read it in?
Close due to inactivity.
Hey, I had the same problem and it seems to be a Linux problem. This code below I run on Linux Mint and Windows, only on Windows it runs. WSL does work fine for me Running gzip directly in terminal also gives an error on linux.
import gzip
from pathlib import Path
data_path = (
Path("ord-data")
/ "data"
/ "00"
/ "ord_dataset-00005539a1e04c809a9a78647bea649c.pb.gz"
)
with gzip.open(data_path, mode="rb") as f:
for line in f:
print(line)
break
If I run file ord_dataset-00005539a1e04c809a9a78647bea649c.pb.gz
I see the file is not really marked as gz but as raw text.
I will post it if I find a workaround
Intresting part, on WSL it does work. I downloaded every thing on windows and now opening it in WSL
file ord-data/data/00/ord_dataset-00005539a1e04c809a9a78647bea649c.pb.gz
ord-data/data/00/ord_dataset-00005539a1e04c809a9a78647bea649c.pb.gz: gzip compressed data, was "ord_dataset-00005539a1e04c809a9a78647bea649c.pb", last modified: Thu Jan 1 00:00:01 1970, max compression, original size modulo 2^32 1481047
So it seems that windows fixes the data types based on the postfix? Or something?
I found out what went wrong with me, git-lfs did not work. So the whole file was only
version https://git-lfs.github.com/spec/v1
oid sha256:491531891f31ad2bbfdfccf347e265fa82b7e2bd7fda5813d6473dd309c29d4a
size 261136
For those getting here in the future. Try looking at the size of the .pb.gz
files if they are 1kb you might need to look at you git-lfs
@wagenrace thanks for digging; I wish there was a way to require git-lfs before cloning.
Only think I can think of is to add a check in ord_schema
package. So if this happens the error might be clearer to the user
Something like
try opening the gzip, if fail return message "the gzip could not be opened. Did git lfs pull
went correctly?"
It is not a solution but might be nice
Describe the bug download the datasets but can't use it. seems the format is wrong. To Reproduce Steps to reproduce the behavior:
~/Software/miniconda3/envs/sanqi/lib/python3.7/site-packages/ord_schema/message_helpers.py in load_message(filename, message_type) 737 return text_format.Parse(f.read(), message_type()) 738 if input_format == MessageFormat.BINARY: --> 739 return message_type.FromString(f.read()) 740 except ( 741 json_format.ParseError,
~/Software/miniconda3/envs/sanqi/lib/python3.7/gzip.py in read(self, size) 285 import errno 286 raise OSError(errno.EBADF, "read() on write-only GzipFile object") --> 287 return self._buffer.read(size) 288 289 def read1(self, size=-1):
~/Software/miniconda3/envs/sanqi/lib/python3.7/gzip.py in read(self, size) 472 # jump to the next member, if there is one. 473 self._init_read() --> 474 if not self._read_gzip_header(): 475 self._size = self._pos 476 return b""
~/Software/miniconda3/envs/sanqi/lib/python3.7/gzip.py in _read_gzip_header(self) 420 421 if magic != b'\037\213': --> 422 raise OSError('Not a gzipped file (%r)' % magic) 423 424 (method, flag,
OSError: Not a gzipped file (b've')
In [26]: !gzip -d ord_dataset-00005539a1e04c809a9a78647bea649c.pb.gz
gzip: ord_dataset-00005539a1e04c809a9a78647bea649c.pb.gz: not in gzip format