mlfoundations / open_clip

An open source implementation of CLIP.
Other
9.93k stars 959 forks source link

On the bug of sampling the same shards between different nodes in `ResampledShards2` when using `resample` #743

Closed SCZwangxiao closed 10 months ago

SCZwangxiao commented 10 months ago

When using resample, the code uses ResampledShards2 with deterministic behavior: https://github.com/mlfoundations/open_clip/blob/91923dfc376afb9d44577a0c9bd0930389349438/src/training/data.py#L313-L319

where the pytorch_worker_seed() function relies on get_worker_info() from torch.utils.data: https://github.com/mlfoundations/open_clip/blob/91923dfc376afb9d44577a0c9bd0930389349438/src/training/data.py#L222-L233

However, this function only returns information about workers in the current dataloader, and seeds of the same worker id among different nodes are the same.

gabrielilharco commented 10 months ago

Hey @SCZwangxiao, thanks for opening the issue. get_worker_info().seed uses the base seed in each process and worker id to create a seed for each worker (by adding base_seed+worker_id). This base_seed does take rank into account because we set torch's seed for each process based on the rank (see https://github.com/mlfoundations/open_clip/blob/91923dfc376afb9d44577a0c9bd0930389349438/src/training/main.py#L44C1-L47C29).

So there shouldn't be overlap in seeds because the base seed is a different long for each process that is generated using its own rng (and base seeds are not consecutive for different processes). Are you observing something different?

SCZwangxiao commented 10 months ago

https://github.com/mlfoundations/open_clip/blob/91923dfc376afb9d44577a0c9bd0930389349438/src/training/main.py#L44C1-L47C29

Oh, I see! Thank you for the response. I got the same seed among nodes because I set torch.manual_seed just based on the seed.

I should have been more careful when migrating between training frameworks.