janosh / matbench-discovery

An evaluation framework for machine learning models simulating high-throughput materials discovery.
https://matbench-discovery.materialsproject.org
MIT License
90 stars 12 forks source link

df_summary.index contains nan values #27

Closed pbenner closed 1 year ago

pbenner commented 1 year ago
> python data/wbm/fetch_process_wbm_dataset.py
[...]
Traceback (most recent call last):
  File "/home/pbenner/Source/tmp/matbench-discovery/data/wbm/fetch_process_wbm_dataset.py", line 331, in <module>
    df_summary.index = df_summary.index.map(increment_wbm_material_id)  # format IDs
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6158, in map
    new_values = self._map_values(mapper, na_action=na_action)
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pandas/core/base.py", line 924, in _map_values
    new_values = map_f(values, mapper)
  File "pandas/_libs/lib.pyx", line 2834, in pandas._libs.lib.map_infer
  File "/home/pbenner/Source/tmp/matbench-discovery/data/wbm/fetch_process_wbm_dataset.py", line 147, in increment_wbm_material_id
    prefix, step_num, material_num = wbm_id.split("_")
AttributeError: 'float' object has no attribute 'split'

caused by nan values at positions 185450, 185451, 185473, 185474, 185476, 185477

janosh commented 1 year ago

Good catch... again.

Sadly this script takes too many resources to run in GH CI. I tried in

https://github.com/janosh/matbench-discovery/blob/1db982a01a8518c5e16038d1893bc0b2840fccac/.github/workflows/test-scripts.yml#L17-L18

but the jobs were killed. Probably due to high mem use. So thanks for continually testing this. Hopefully it works now! 🤞