microsoft / tf2-gnn

TensorFlow 2 library implementing Graph Neural Networks
MIT License
371 stars 73 forks source link

Loading custom dataset and building the graph in batches #52

Closed imayachita closed 3 years ago

imayachita commented 3 years ago

Hello!

I have my own dataset and I want to use this library to train GNNs. My dataset is in TFRecords format and read using tf.data.Dataset as_numpy_iterator(), where 1 batch of iterator represents 1 graph. How can I feed in this data to this library? I want to construct 1 huge train graph (which consists of multiple graphs from the numpy iterator) and 1 huge test graph that follows PPI Dataset format. I guess this implies that the train and test graph object generation will be done in batches. Is that possible? Or if the resulted graph object or matrices are too large, how can I train in minibatches? Thanks

megstanley commented 3 years ago

The GraphDataset object assembles batches composed of a large single graph (which may be disconnected, i.e. composed of many smaller graphs). The batching assembles the largest graph possible from component graphs according to the dataset parameter "max_nodes_per_batch", which can be set according to need and memory limitations. To read from a numpy iterator, making a graph sample iterator that consumes your data and reassembles in to the graph sample form (as seen in JsonLGraphDataset example) may be the solution. The iterator would then be used by the graph batching method, and thus supplied to the get_tensorflow_dataset method of the parent GraphDataset with the correct batching implemented. Is this what you intended?

imayachita commented 3 years ago

May you point me to the JsonLGraphDataset example you referred to because I'm not sure I found that? Thanks

megstanley commented 3 years ago

https://github.com/microsoft/tf2-gnn/blob/master/tf2_gnn/data/jsonl_graph_dataset.py

imayachita commented 3 years ago

Thanks! Any example/tutorial on how to use it (esp with the batching)?

megstanley commented 3 years ago

Following how a dataset and model is instantiated from this point may indicate how to use: https://github.com/microsoft/tf2-gnn/blob/master/tf2_gnn/cli/train.py

The JsonLGraphDataset is a specific example of use, the change required to use this with a different input format involves custom load_data and _graph_iterator methods (see base class GraphDataset). Another example of use is https://github.com/microsoft/tf2-gnn/blob/master/tf2_gnn/data/jsonl_graph_property_dataset.py

As an alternative to making a custom dataloader, one could place data in the JsonLines format specified in the repo README and directly use JsonLGraphDataset or JsonLGraphPropertyDataset. Both loaders implement batching with maximum number of graph nodes to be used per batch to be chosen by the user.