open-reaction-database / ord-data

Official data repository for the Open Reaction Database
https://open-reaction-database.org
Creative Commons Attribution Share Alike 4.0 International
210 stars 53 forks source link

Problem about the `gzip` file #162

Closed ZeroDesigner closed 3 months ago

ZeroDesigner commented 1 year ago

Describe the bug download the datasets but can't use it. seems the format is wrong. To Reproduce Steps to reproduce the behavior:

  1. git clone the datasets
  2. input code
    import ord_schema
    from ord_schema.proto import dataset_pb2
    from ord_schema import message_helpers, validations
    data = message_helpers.load_message('./ord_dataset-00005539a1e04c809a9a78647bea649c.pb.gz', dataset_pb2.Dataset)`
  3. See error
    
    ---------------------------------------------------------------------------
    OSError                                   Traceback (most recent call last)
    <ipython-input-7-f2bc4d01351d> in <module>
    ----> 1 data = message_helpers.load_message('ord_dataset-00005539a1e04c809a9a78647bea649c.pb.gz', dataset_pb2.Dataset)

~/Software/miniconda3/envs/sanqi/lib/python3.7/site-packages/ord_schema/message_helpers.py in load_message(filename, message_type) 737 return text_format.Parse(f.read(), message_type()) 738 if input_format == MessageFormat.BINARY: --> 739 return message_type.FromString(f.read()) 740 except ( 741 json_format.ParseError,

~/Software/miniconda3/envs/sanqi/lib/python3.7/gzip.py in read(self, size) 285 import errno 286 raise OSError(errno.EBADF, "read() on write-only GzipFile object") --> 287 return self._buffer.read(size) 288 289 def read1(self, size=-1):

~/Software/miniconda3/envs/sanqi/lib/python3.7/gzip.py in read(self, size) 472 # jump to the next member, if there is one. 473 self._init_read() --> 474 if not self._read_gzip_header(): 475 self._size = self._pos 476 return b""

~/Software/miniconda3/envs/sanqi/lib/python3.7/gzip.py in _read_gzip_header(self) 420 421 if magic != b'\037\213': --> 422 raise OSError('Not a gzipped file (%r)' % magic) 423 424 (method, flag,

OSError: Not a gzipped file (b've')


**Additional context**
1. I use `gzip` to deal with the file directly . 

In [26]: !gzip -d ord_dataset-00005539a1e04c809a9a78647bea649c.pb.gz

gzip: ord_dataset-00005539a1e04c809a9a78647bea649c.pb.gz: not in gzip format


2. The older version is working.

`8b83754b865c8a9f30667fbea4dfdc892d4dad60`
skearnes commented 1 year ago

Hi @ZeroDesigner, can you provide more details on the versions you have for ord-data and ord-schema? (This is working fine for me.) You should also make sure you're running python 3.10 or later.

ZeroDesigner commented 1 year ago

ok, I test on several versions.

1 . python =3.7.16, ord-schema=0.3.25

ord-data$ git log
commit 51b1581a9ac9003576db04a4f1ecc2d9fb3e9097 (HEAD -> main, origin/main, origin/HEAD)
Author: Steven Kearnes <skearnes@relaytx.com>
Date:   Fri Dec 16 09:45:50 2022 -0500

    Bump ord-schema (#157)

    * bump ord-schema version

    * Bump ord-schema version

    * bump ord-schema

    * more splits

    Co-authored-by: EC2 Default User <ec2-user@ip-172-30-4-170.ec2.internal>

Error:

pb_path_2 = 'ord-data/data/00/ord_dataset-00005539a1e04c809a9a78647bea649c.pb.gz'
data = message_helpers.load_message(pb_path_2, dataset_pb2.Dataset)
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-3-d4377973e480> in <module>
----> 1 data = message_helpers.load_message(pb_path_2, dataset_pb2.Dataset)

python3.7/site-packages/ord_schema/message_helpers.py in load_message(filename, message_type)
    737                 return text_format.Parse(f.read(), message_type())
    738             if input_format == MessageFormat.BINARY:
--> 739                 return message_type.FromString(f.read())
    740         except (
    741             json_format.ParseError,

python3.7/gzip.py in read(self, size)
    285             import errno
    286             raise OSError(errno.EBADF, "read() on write-only GzipFile object")
--> 287         return self._buffer.read(size)
    288 
    289     def read1(self, size=-1):

python3.7/gzip.py in read(self, size)
    472                 # jump to the next member, if there is one.
    473                 self._init_read()
--> 474                 if not self._read_gzip_header():
    475                     self._size = self._pos
    476                     return b""

python3.7/gzip.py in _read_gzip_header(self)
    420 
    421         if magic != b'\037\213':
--> 422             raise OSError('Not a gzipped file (%r)' % magic)
    423 
    424         (method, flag,

OSError: Not a gzipped file (b've')
  1. python=3.10.11, ord-schema=0.3.53

    ord-data$ git log
    commit 51b1581a9ac9003576db04a4f1ecc2d9fb3e9097 (HEAD -> main, origin/main, origin/HEAD)
    Author: Steven Kearnes <skearnes@relaytx.com>
    Date:   Fri Dec 16 09:45:50 2022 -0500
    
    Bump ord-schema (#157)
    
    * bump ord-schema version
    
    * Bump ord-schema version
    
    * bump ord-schema
    
    * more splits
    
    Co-authored-by: EC2 Default User <ec2-user@ip-172-30-4-170.ec2.internal>

    error:

In [2]:     pb_path_2 = 'ord-data/data/00/ord_dataset-00005539a1e04c809a9a78647bea649c.pb.gz'
   ...: 

In [3]:     data = message_helpers.load_message(pb_path_2, dataset_pb2.Dataset)
   ...: 
---------------------------------------------------------------------------
BadGzipFile                               Traceback (most recent call last)
Cell In[3], line 1
----> 1 data = message_helpers.load_message(pb_path_2, dataset_pb2.Dataset)

File python3.10/site-packages/ord_schema/message_helpers.py:763, in load_message(filename, message_type)
    761         return text_format.Parse(f.read(), message_type())
    762     if input_format == MessageFormat.BINARY:
--> 763         return message_type.FromString(f.read())
    764 except (
    765     json_format.ParseError,
    766     protobuf.message.DecodeError,
    767     text_format.ParseError,
    768 ) as error:
    769     raise ValueError(f"error parsing {filename}: {error}") from error

File python3.10/gzip.py:301, in GzipFile.read(self, size)
    299     import errno
    300     raise OSError(errno.EBADF, "read() on write-only GzipFile object")
--> 301 return self._buffer.read(size)

File /python3.10/_compression.py:118, in DecompressReader.readall(self)
    114 chunks = []
    115 # sys.maxsize means the max length of output buffer is unlimited,
    116 # so that the whole input buffer can be decompressed within one
    117 # .decompress() call.
--> 118 while data := self.read(sys.maxsize):
    119     chunks.append(data)
    121 return b"".join(chunks)

File python3.10/gzip.py:488, in _GzipReader.read(self, size)
    484 if self._new_member:
    485     # If the _new_member flag is set, we have to
    486     # jump to the next member, if there is one.
    487     self._init_read()
--> 488     if not self._read_gzip_header():
    489         self._size = self._pos
    490         return b""

File python3.10/gzip.py:436, in _GzipReader._read_gzip_header(self)
    433     return False
    435 if magic != b'\037\213':
--> 436     raise BadGzipFile('Not a gzipped file (%r)' % magic)
    438 (method, flag,
    439  self._last_mtime) = struct.unpack("<BBIxx", self._read_exact(8))
    440 if method != 8:

BadGzipFile: Not a gzipped file (b've')
skearnes commented 1 year ago

I'm still not able to reproduce this. What happens if you gunzip the file and then try to read it in?

bdeadman commented 3 months ago

Close due to inactivity.

wagenrace commented 1 month ago

Hey, I had the same problem and it seems to be a Linux problem. This code below I run on Linux Mint and Windows, only on Windows it runs. WSL does work fine for me Running gzip directly in terminal also gives an error on linux.

import gzip
from pathlib import Path

data_path = (
    Path("ord-data")
    / "data"
    / "00"
    / "ord_dataset-00005539a1e04c809a9a78647bea649c.pb.gz"
)

with gzip.open(data_path, mode="rb") as f:
    for line in f:
        print(line)
        break

If I run file ord_dataset-00005539a1e04c809a9a78647bea649c.pb.gz I see the file is not really marked as gz but as raw text. I will post it if I find a workaround

wagenrace commented 1 month ago

Intresting part, on WSL it does work. I downloaded every thing on windows and now opening it in WSL

file ord-data/data/00/ord_dataset-00005539a1e04c809a9a78647bea649c.pb.gz 
ord-data/data/00/ord_dataset-00005539a1e04c809a9a78647bea649c.pb.gz: gzip compressed data, was "ord_dataset-00005539a1e04c809a9a78647bea649c.pb", last modified: Thu Jan  1 00:00:01 1970, max compression, original size modulo 2^32 1481047

So it seems that windows fixes the data types based on the postfix? Or something?

wagenrace commented 1 month ago

I found out what went wrong with me, git-lfs did not work. So the whole file was only

version https://git-lfs.github.com/spec/v1
oid sha256:491531891f31ad2bbfdfccf347e265fa82b7e2bd7fda5813d6473dd309c29d4a
size 261136

For those getting here in the future. Try looking at the size of the .pb.gz files if they are 1kb you might need to look at you git-lfs

skearnes commented 1 month ago

@wagenrace thanks for digging; I wish there was a way to require git-lfs before cloning.

wagenrace commented 1 month ago

Only think I can think of is to add a check in ord_schema package. So if this happens the error might be clearer to the user Something like try opening the gzip, if fail return message "the gzip could not be opened. Did git lfs pull went correctly?"

It is not a solution but might be nice