snap-stanford / ogb

Benchmark datasets, data loaders, and evaluators for graph machine learning
https://ogb.stanford.edu
MIT License
1.93k stars 397 forks source link

Custom datasets #57

Closed rickyloynd-microsoft closed 4 years ago

rickyloynd-microsoft commented 4 years ago

When developing a new model, it's usually necessary to create some small, temporary, synthetic datasets for debugging, before experimenting with real datasets. What is the recommended way of connecting up our own custom datasets to the rest of OGB? For instance, should we create our own version of GraphPropPredDataset?

weihua916 commented 4 years ago

Great question. I think the quickest way is for you to create your own version, i.e., modify the dataset and split as you like, and save them (torch.save()) and load them (torch.load()). Our OGB dataset objects need to handle downloading, preprocessing, splitting etc, and it is not so trivial to quickly add a custom dataset.

FYI: We are now working hard to create a pipeline so that externally-contributed datasets can be incorporated into OGB package.

rickyloynd-microsoft commented 4 years ago

Thank you for the quick response. I modified the ogbg-molhiv example to create a synthetic dataset, and it seems to work fine with the gin-virtual model from PyG. I hope that your new pipeline will make it easier to create small, synthetic datasets for debugging.

vthost commented 3 years ago

A relatively easy solution which worked for me for several datasets so far is to just cut the nodes in ogb.io.read_graph_raw (e.g. take the first 1k) and adapt edge index, additional information, and train indices accordingly (i.e. filter out anything for nodes with index >1k).

sophiakrix commented 3 years ago

I would also be interested in creating own datasets in ogb. Is there any update on the pipeline to integrate them @weihua916 ? That would be awesome!

weihua916 commented 3 years ago

You can use the DatasetSaver class.

sophiakrix commented 2 years ago

Thanks, that's what I've been looking for @weihua916 !

I just tried to use the DatasetSaver class with a heterogeneous graph, setting the parameter is_hetero=True. This causes an error, saying that this is not implemented yet:

NotImplementedError: Heterogeneous graph dataset object has not been implemented for graph property prediction yet.
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-223-873c0be554a3> in <module>
      1 # constructor
      2 dataset_name = 'ogbg-cgkg'
----> 3 saver = DatasetSaver(dataset_name = dataset_name, is_hetero = True, version = 1)

~/git/ogb/ogb/io/save_dataset.py in __init__(self, dataset_name, is_hetero, version, root)
     38 
     39         if self.dataset_prefix == 'ogbg' and self.is_hetero:
---> 40             raise NotImplementedError('Heterogeneous graph dataset object has not been implemented for graph property prediction yet.')
     41 
     42         if osp.exists(self.dataset_dir):

NotImplementedError: Heterogeneous graph dataset object has not been implemented for graph property prediction yet.

Is there any way to work with heterogeneous graphs then?