Distributed training - Githubissues

krzysztofrusek commented 4 years ago

As far as I understand GraphNets represents batches of graphs as one disconnected graph. This is very efficient for single device training, however I think this approach does not work in a distributed training environment. My understanding is that a Strategy shards the batch along the first dimension, thus all observations along the first dimension must be independent. This is obviously not the case for a disconnected graph as parts of the same graph could end up being processed by different devices and in the end, message passing will be not accurate. Are there any plans to use e.g. RaggedTensors in graph_nets?

alvarosg commented 4 years ago

Hi, thanks for your message!

We currently have no plans to move towards RaggedTensors due to some limitations of GPU and TPU support w.r.t. Ragged Tensors. However, there are two ways that we often successfully run distributed training that does not require any changes to the library:

Option 1: Set up the dataset to fetch independent GraphsTuples (with just one graph if you wish) for each distributed device. This way there is not need to split the axis along the first axis.
Option 2: Build a dataset such that uses ragged tensors to stack as many GraphsTuple together as distributed jobs you plan to run (where each GraphsTuple may contain 1 or more graphs). You could just use the tree or nest library to stack all tensors with tf.ragged.stack(..., axis=0)). This will add an additional batch axis at the beginning of each tensor within the GraphsTuple that each distributed device can then index into keeping the nested structure, and it should just give you regular GraphsTuples at each device.

Hope this helps!

charlinergr commented 4 years ago

Hi !

Could you provide an example of a distributed training ?

alvarosg commented 4 years ago

We do not currently have any examples in the library, however some users seem to be doing distributed training with our library:

Using Horovod/MPI.
Using TF2.x distributed strategy. Note this example is for TPU, so it requires to padding all graphs to a fixed size so tf.function does not need to retrace the function for every new size (We will be releasing some helper methods for this in the near future).

google-deepmind / graph_nets

Distributed training #121