AFAIK there are no "distributed tensors" or anything like that. But there are things like sending and receiving tensors via MPI. This means that we have to set up the ghost layers in a distributed simulation by hand (only in the streaming step and grid generation).
This requires a version of PyTorch that is built with MPI (I don't think they have these on conda, so we need to build it by hand).
Here is a list of things that would be required on the way to a usable multinode implementation:
A grid class. Instead of being defined as a np.array inside the flow, the grid needs to be an instance of a class RegularGrid.
When an MPI process accesses the grid, it should only see the rectangular part of the grid that is associated with its domain plus the ghost layers. In this way, all the distribution functions etc. have the dimension of the local domain.
The grid class also stores information about the ghost layers and the processes that are associated with the neighbor domain.
The streaming operator has to do the roll operation plus the communication of information in the ghost layer via torch.distributed.send() and torch.distributed.recv().
When computing global quantities (energy, ...) we need to do a torch.distributed.allreduce().
Distributed I/O of vtk files is possible (see NATriuM) , but I don't know if it is supported by pyevtk, yet. If not, it should be quite easy to add.
At this point, I am not sure how efficient and feasible such an MPI implementation would be. We should try this outside lettuce first for a mini example before proceeding.
Can we do Multi-GPU over MPI?
Here is a summary of distributed computing capabilities https://pytorch.org/docs/stable/distributed.html
AFAIK there are no "distributed tensors" or anything like that. But there are things like sending and receiving tensors via MPI. This means that we have to set up the ghost layers in a distributed simulation by hand (only in the streaming step and grid generation).
This requires a version of PyTorch that is built with MPI (I don't think they have these on conda, so we need to build it by hand).
Here is a list of things that would be required on the way to a usable multinode implementation:
RegularGrid
.torch.distributed.send()
andtorch.distributed.recv()
.torch.distributed.allreduce()
.Here is an example of distributed computing in pytorch https://www.glue.umd.edu/hpcc/help/software/pytorch.html
At this point, I am not sure how efficient and feasible such an MPI implementation would be. We should try this outside lettuce first for a mini example before proceeding.