microsoft / InnerEye-DeepLearning

Medical Imaging Deep Learning library to train and deploy 3D segmentation models on Azure Machine Learning
https://aka.ms/innereyeoss
MIT License
557 stars 142 forks source link

Add file synchronization support for multiple nodes #551

Open ant0nsc opened 3 years ago

ant0nsc commented 3 years ago

How can we synchronize files that are written during multi-node training?

helper function for sync:

from pl_bolts.models.self_supervised.simclr.simclr_module import SyncFunction
def synchronize_across_gpus(tensor: torch.Tensor) -> torch.Tensor:
    """
    Synchronizes a tensor across all GPUs, if distributed computation is enabled. The tensors from all GPUs are stacked
    up along the batch dimension (dim=0) using torch.cat. If no distributed setup is available, return the argument
    unchanged.
    :param tensor: The tensor that should be synchronized, of size [B, ...]
    :return: If torch.distributed is enabled, return a tensor of size [B * num_GPUs, ...]. If not distributed,
    return the argument of size [B, ...] unchanged.
    """
    if torch.distributed.is_available() and torch.distributed.is_initialized():
        synced = SyncFunction.apply(tensor)
        return synced
    return tensor

(from https://github.com/microsoft/InnerEye-DeepLearning/blob/antonsc/diceloss/InnerEye/ML/models/losses/soft_dice.py)

AB#4357

aryasoni98 commented 3 years ago

could I work on this?

Shruthi42 commented 3 years ago

Hi @aryasoni98, thanks for your interest in picking up this task! @ant0nsc has done a great job summarizing the requirements, but I'm happy to clarify further if needed.

For some context, so far we've dealt with the issue of multiple nodes writing to the same file by creating unique files per node (see here for example). The files are created within the lightning modules, so we retrieve the global rank from the trainer to create unique files. We don't yet have any code that syncs these files across nodes to create a single file.