snap-stanford / ogb

Benchmark datasets, data loaders, and evaluators for graph machine learning
https://ogb.stanford.edu
MIT License
1.89k stars 397 forks source link

Pandas 2.0.0 Compatibility #419

Closed tempoxylophone closed 1 year ago

tempoxylophone commented 1 year ago

Pandas 2.0.0 was released today. A new argument called dtype_backend was added to the read_csv() function that appears to affect the default behavior when reading null values.

When the respective master.csv are read with Pandas 2.0.0, when the value "None" is written in a string, it appears to now be parsed by default to NaN. This is problematic in places where the meta_info object dictionary's property additional edge files and additional node files are assumed to be a string.

if self.meta_info['additional node files'] == 'None':
    additional_node_files = []
else:
    additional_node_files = self.meta_info['additional node files'].split(',')

The .split(",") function called on these will throw AttributeError: 'float' object has no attribute 'split'.

There are more instances that can cause this exception beyond the two above.

I considered two options - either edit the script and corresponding master.csv files to contain the empty string instead of 'None', which are parsed as "" instead of NaN, or add the keyword argument keep_default_na=False to instances of pd.read_csv where this could be an issue. This keyword argument prevents the "None"s from being parsed as NaNs.

Seeing as there are more instances of the latter option and would require a larger diff, I opted for the former approach. This involved editing the make_master_file.py files in their respective directories. I may have discovered a small inconsistency with the Python code in ogb/linkproppred/make_master_file.py for the has_edge_attr property for the ogbl-vessel dataset. In make_master_file.py, this property was set to False, but the committed file in the latest release has this property set to True in the csv file.

tempoxylophone commented 1 year ago

Closing this because editing the csv files in this way produces the exact problem in earlier versions of pandas. A better solution would be to add the keep_default_na=False keyword argument for all instances where the csv files are read.