pytorch / ignite

High-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently.
https://pytorch-ignite.ai
BSD 3-Clause "New" or "Revised" License
4.51k stars 610 forks source link

Improve idist for gather using nccl and reduce for gloo + gpu #2260

Open sdesrozis opened 2 years ago

sdesrozis commented 2 years ago

🚀 Feature

Consider the following piece of code

def write_preds_to_file(predictions, filename):
    prediction_tensor = torch.tensor(predictions)
    prediction_tensor = idist.all_gather(prediction_tensor)

    if idist.get_rank() == 0:
        torch.save(prediction_tensor, filename)

The idist.all_gather() is used to collect the tensor from all the processes even if only the rank 0 needs it. The gather() method would be used but the backend nccl does not support it. See here.

The idea here is to implement the gather() method in idist using all_gather() for nccl (and gather() for others backends). Note that reduce() for gloo on GPU could be implemented using all_reduce() in a similar way.

It needs tests + docs

fco-dv commented 2 years ago

cool feature @sdesrozis !