Dataloader workers in different gpus may get the same randomness when multi-processes training

chrisway613 commented 8 months ago

Hi, it's me again! I think there maybe a problem with dataloader reseeding workers in multi-gpus training, workers with the same worker_id in different gpus will get the same randomness if we use the way as below(as repo):

https://github.com/nnaisense/bayesian-flow-networks/blob/896ea205debb4896b27a61e79e378b720a926309/utils_train.py#L60

def worker_init_function(worker_id: int) -> None:
    """https://pytorch.org/docs/stable/notes/randomness.html#dataloader"""
    worker_seed = torch.initial_seed() % 2**32
    np.random.seed(worker_seed)
    random.seed(worker_seed)

https://github.com/nnaisense/bayesian-flow-networks/blob/896ea205debb4896b27a61e79e378b720a926309/utils_train.py#L67

def get_generator(seed: int):
    g = torch.Generator()
    g.manual_seed(seed)
    return g

One way to avoid this problem is to seed generator by the specified seed and the rank, and this may look like:

def get_generator(seed: int):
    import torch.distributed as dist

    rank = dist.get_rank()
    seed += rank

    g = torch.Generator()
    g.manual_seed(seed)

    return g

Following this way, we don't even have to set worker_init_fn in dataloader, and different gpus will have different _base_seed in their dataloaders, finally making them(each worker in each gpu) own their unique randomness.

flukeskywalker commented 8 months ago

Yes it doesn't affect the current configs I think, but I've made things cleaner and also added a seed requirement (instead of unsuccessfully trying to autogenerate one if not provided) here. Have a look.

chrisway613 commented 8 months ago

But this made each rank have a different seed(not only worker seed inside dataloader), which will lead to the network of each rank to be initialized with different parameter values(we use random initializatoin by default in most cases). Even so, DDP can ensure that the initial network parameters of each rank are consistent. Personally, I prefer to make each rank only the worker seed inside dataloader different, not all the seeds. But.. it's all up to you, not a serious problem.

flukeskywalker commented 8 months ago

Yeah initialization is not an issue since DDP syncs the params and buffers after init. Thanks for flagging.

nnaisense / bayesian-flow-networks

Dataloader workers in different gpus may get the same randomness when multi-processes training #3