snap-stanford / ogb

Benchmark datasets, data loaders, and evaluators for graph machine learning
https://ogb.stanford.edu
MIT License
1.93k stars 397 forks source link

ogbg-molchembl dataset fails due to default pickle_protocol #48

Closed sooheon closed 4 years ago

sooheon commented 4 years ago

How to repro:

ds = GraphPropPredDataset('ogbg-molchembl', root='/tmp/ogb_datasets') fails at the torch.save step of the pre_process method, because pickle "cannot serialize a string larger than 4 gb".

What I've done:

I've tried setting torch.serialization.DEFAULT_PROTOCOL = 4 (which according to this adds support for large objects) before calling above, but this did not help -- I think it should be passed as arg to torch.save.

sooheon commented 4 years ago

This does not occur with the dgl variant.

weihua916 commented 4 years ago

Interesting. I tested the following locally, and it worked fine. Could you clarify what you mean by "This does not occur with the dgl variant."?

from ogb.graphproppred import DglGraphPropPredDataset

d_name = 'ogbg-molchembl'
dataset = DglGraphPropPredDataset(name = d_name) 
sooheon commented 4 years ago

Ah I mean the Dgl variant works, and it's the pure python dataset which fails to save the pickle file. I think the further processing to Dgl datastructure reduces size of the pickle enough.

Edited root comment to reflect it's GraphPropPredDataset that fails.

weihua916 commented 4 years ago

You are right, thanks for noticing this. I have resolved the issue in the master branch by using protocol = 4 in torch.save().