Open ant0nsc opened 3 years ago
could I work on this?
Hi @aryasoni98, thanks for your interest in picking up this task! @ant0nsc has done a great job summarizing the requirements, but I'm happy to clarify further if needed.
For some context, so far we've dealt with the issue of multiple nodes writing to the same file by creating unique files per node (see here for example). The files are created within the lightning modules, so we retrieve the global rank from the trainer to create unique files. We don't yet have any code that syncs these files across nodes to create a single file.
How can we synchronize files that are written during multi-node training?
helper function for sync:
(from https://github.com/microsoft/InnerEye-DeepLearning/blob/antonsc/diceloss/InnerEye/ML/models/losses/soft_dice.py)
AB#4357