snap-stanford / ogb

Benchmark datasets, data loaders, and evaluators for graph machine learning
https://ogb.stanford.edu
MIT License
1.89k stars 397 forks source link

Some problem about loading ogbn-papers100M #404

Closed zhuangbility111 closed 1 year ago

zhuangbility111 commented 1 year ago

Hi! I'm trying to load the ogbn-papers100M with following code:

from ogb.nodeproppred import NodePropPredDataset
dataset = NodePropPredDataset(name='ogbn-papers100M')

but it output some errors like (part of it):

Traceback (most recent call last):
  File "node_prop_pred_data.py", line 24, in <module>
    dataset = NodePropPredDataset(name=dataset_name, root = dataset_save_location)
  File "/home/min/a/user/data/.venv/lib/python3.6/site-packages/ogb/nodeproppred/dataset.py", line 63, in __init__
    self.pre_process()
  File "/home/min/a/user/data/.venv/lib/python3.6/site-packages/ogb/nodeproppred/dataset.py", line 139, in pre_process
    torch.save({'graph': self.graph, 'labels': self.labels}, pre_processed_file_path, pickle_protocol=4)
  File "/home/min/a/user/data/.venv/lib/python3.6/site-packages/torch/serialization.py", line 372, in save
    _save(obj, opened_zipfile, pickle_module, pickle_protocol)
MemoryError -- std::bad_alloc

It seems like OOM... The memory size of my machine is 256GB

BTW, I also try to load the raw graph by myself. But when I load the node_label.npz :

>>> import numpy as np
>>> label = np.load('node-label.npz')
>>> label
<numpy.lib.npyio.NpzFile object at 0x2b15ac5232b0>
>>> label.__dict__
{'_files': ['node_label.npy'], 'files': ['node_label'], 'allow_pickle': False, 'pickle_kwargs': {'encoding': 'ASCII', 'fix_imports': True}, 'zip': <zipfile.ZipFile file=<_io.BufferedReader name='node-label.npz'> mode='r'>, 'f': <numpy.lib.npyio.BagObj object at 0x2b15df759640>, 'fid': <_io.BufferedReader name='node-label.npz'>}
>>> node_label = label['node_label']
>>> node_label
array([[ nan],
       [ nan],
       [ nan],
       ...,
       [157.],
       [ nan],
       [ nan]], dtype=float32)
>>>

Why there are so many nan on the label file?

weihua916 commented 1 year ago

I think the error is due to the corrupted file. You may delete the old file and download it again. The nan value means unlabeled.