Plans for MPI_THREAD_MULTIPLE?

tensorflow / networking

Enhanced networking support for TensorFlow. Maintained by SIG-networking.

97 stars 36 forks source link

Plans for MPI_THREAD_MULTIPLE? #17

Open rzambre opened 5 years ago

rzambre commented 5 years ago

https://github.com/tensorflow/networking/blob/master/tensorflow_networking/mpi/mpi_utils.cc#L56

I see the use of MPI_THREAD_MULTIPLE has been commented out. From my understanding of the current design of exchanging data with MPI, we do not require MPI_THREAD_MULTIPLE since a dedicated thread is responsible for communication.

Are there future plans of having multiple threads perform communication simultaneously (once MPI implementations better support MPI_THREAD_MULTIPLE of course)? If so, is it more likely that we have dedicated communication threads or is it possible that the computation threads also perform communication?

jbedorf commented 5 years ago

In an earlier version I indeed used MPI_THREAD_MULTIPLE to have multiple computation threads perform their own communication. Thereby reducing the load on the communication thread. It turned out to be too unstable at that point in time as the various MPI distributions would give random errors and deadlocks. I would be worthwhile to explore this again in a future version once the code has been converted to support the TF 2.0 C API.

rzambre commented 5 years ago

I see. Do you remember which MPI libraries you experimented with?

With multiple threads participating in communication, there exits a design space which could explore the use of separate communicators, tags, etc. to expose parallel communication to the MPI library. Is there a communication kernel mini-application or microbenchmark that captures the communication pattern of Tensorflow? That would serve well to explore the performance of the different strategies in the design space of parallel MPI communication.

rzambre commented 4 years ago

If a mini-app isn't available, I would be happy to help with writing a mini-app that captures the communication pattern of Tensorflow.