Pandas 2.0.0 Compatibility

Pandas 2.0.0 was released today. A new argument called dtype_backend was added to the read_csv() function that appears to affect the default behavior when reading null values.

When the respective master.csv are read with Pandas 2.0.0, when the value "None" is written in a string, it appears to now be parsed by default to NaN. This is problematic in places where the meta_info object dictionary's property additional edge files and additional node files are assumed to be a string.

see lines 106 and 111 in ogb/linkpropped/dataset.py:

if self.meta_info['additional node files'] == 'None':
    additional_node_files = []
else:
    additional_node_files = self.meta_info['additional node files'].split(',')

The .split(",") function called on these will throw AttributeError: 'float' object has no attribute 'split'.

There are more instances that can cause this exception beyond the two above.

I considered two options - either edit the script and corresponding master.csv files to contain the empty string instead of 'None', which are parsed as "" instead of NaN, or add the keyword argument keep_default_na=False to instances of pd.read_csv where this could be an issue. This keyword argument prevents the "None"s from being parsed as NaNs.

Seeing as there are more instances of the latter option and would require a larger diff, I opted for the former approach. This involved editing the make_master_file.py files in their respective directories. I may have discovered a small inconsistency with the Python code in ogb/linkproppred/make_master_file.py for the has_edge_attr property for the ogbl-vessel dataset. In make_master_file.py, this property was set to False, but the committed file in the latest release has this property set to True in the csv file.

snap-stanford / ogb

Pandas 2.0.0 Compatibility #419