janosh / matbench-discovery

An evaluation framework for machine learning models simulating high-throughput materials discovery.
https://matbench-discovery.materialsproject.org
MIT License
105 stars 17 forks source link

Fetching `2023-02-07-ppd-mp.pkl.gz` still fails with UnicodeDecodeError #22

Closed pbenner closed 1 year ago

pbenner commented 1 year ago

fetch_process_wbm_dataset.py now hangs here:

Warning: '/home/pbenner/.cache/matbench-discovery/1.0.0/mp/2023-02-07-ppd-mp.pkl.gz' associated with key='mp_patched_phase_diagram' does not exist. Would you like to download it now using matbench_discovery.data.load_train_test('mp_patched_phase_diagram'). This will cache the file for future use. [y/n] y
Downloading 'mp_patched_phase_diagram' from https://figshare.com/ndownloader/files/40344451

variable dump:
file='mp/2023-02-07-ppd-mp.pkl.gz',
url='https://figshare.com/ndownloader/files/40344451',
reader=<function read_json at 0x7f31b4681120>,
kwargs={'compression': 'gzip'}
Traceback (most recent call last):
  File "/home/pbenner/Source/tmp/matbench-discovery/data/wbm/fetch_process_wbm_dataset.py", line 538, in <module>
    with gzip.open(DATA_FILES.mp_patched_phase_diagram, "rb") as zip_file:
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/matbench_discovery/data.py", line 217, in __getattribute__
    self._on_not_found(key, msg)
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/matbench_discovery/data.py", line 239, in _on_not_found
    load_train_test(key)  # download and cache data file
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/matbench_discovery/data.py", line 111, in load_train_test
    df = reader(url, **kwargs)
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pandas/util/_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pandas/io/json/_json.py", line 733, in read_json
    json_reader = JsonReader(
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pandas/io/json/_json.py", line 819, in __init__
    self.data = self._preprocess_data(data)
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pandas/io/json/_json.py", line 831, in _preprocess_data
    data = data.read()
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

However, manual download seems to work.

janosh commented 1 year ago

My bad for not handling pickle files separately in load_train_test(). Maybe we should rename the function into load() since it's clearly morphing into more than just a training and test set loader. @pbenner Curious to hear your opinion!

pbenner commented 1 year ago

Yes sounds good! Also fetch_process_wbm_dataset.py could be fully integrated and called when first running load()

pbenner commented 1 year ago

This is the error I get using the new branch:

Traceback (most recent call last):
  File "/home/pbenner/Source/tmp/matbench-discovery/data/wbm/fetch_process_wbm_dataset.py", line 322, in <module>
    assert sum(no_id_mask := df_summary.index.isna()) == 6, f"{sum(no_id_mask)=}"
AssertionError: sum(no_id_mask)=0
janosh commented 1 year ago

Are you using pandas v1.x.x? I just changed the code from v1 to v2 compat. I'll downwards pin pandas in pyproject.toml to avoid this in the future.

pbenner commented 1 year ago

Indeed, I had pandas 1.5, trying to check with pandas 2.0. Meanwhile, I think 2023-02-07-mp-elemental-reference-entries.json.gz was modified:

python data/wbm/fetch_process_wbm_dataset.py Loading 'wbm_summary' from cached file at '/home/pbenner/.cache/matbench-discovery/1.0.0/wbm/2022-10-19-wbm-summary.csv' Warning: '/home/pbenner/.cache/matbench-discovery/1.0.0/mp/2023-02-07-mp-elemental-reference-entries.json.gz' associated with key='mp_elemental_ref_entries' does not exist. Would you like to download it now using matbench_discovery.data.load_train_test('mp_elemental_ref_entries'). This will cache the file for future use. [y/n] y Downloading 'mp_elemental_ref_entries' from https://figshare.com/ndownloader/files/40344445

variable dump: file='mp/2023-02-07-mp-elemental-reference-entries.json.gz', url='https://figshare.com/ndownloader/files/40344445', reader=<function read_json at 0x7f9898a875b0>, kwargs={'compression': 'gzip'} Traceback (most recent call last): File "/home/pbenner/Source/tmp/matbench-discovery/data/wbm/fetch_process_wbm_dataset.py", line 24, in from matbench_discovery.energy import get_e_form_per_atom File "/home/pbenner/Source/tmp/matbench-discovery/matbench_discovery/energy.py", line 66, in pd.read_json(DATA_FILES.mp_elemental_ref_entries, typ="series") File "/home/pbenner/Source/tmp/matbench-discovery/matbench_discovery/data.py", line 217, in getattribute self._on_not_found(key, msg) File "/home/pbenner/Source/tmp/matbench-discovery/matbench_discovery/data.py", line 239, in _on_not_found load_train_test(key) # download and cache data file File "/home/pbenner/Source/tmp/matbench-discovery/matbench_discovery/data.py", line 111, in load_train_test df = reader(url, **kwargs) File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pandas/io/json/_json.py", line 760, in read_json json_reader = JsonReader( File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pandas/io/json/_json.py", line 862, in init self.data = self._preprocess_data(data) File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pandas/io/json/_json.py", line 874, in _preprocess_data data = data.read() File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/gzip.py", line 301, in read return self._buffer.read(size) File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/_compression.py", line 118, in readall while data := self.read(sys.maxsize): File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/gzip.py", line 488, in read if not self._read_gzip_header(): File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/gzip.py", line 436, in _read_gzip_header raise BadGzipFile('Not a gzipped file (%r)' % magic) gzip.BadGzipFile: Not a gzipped file (b'{\n')

janosh commented 1 year ago

Yeah, I was in the process of updating the Figshare files but then got carried away. That error will be fixed before I merge #26.