Closed SCZwangxiao closed 10 months ago
Hey @SCZwangxiao, thanks for opening the issue. get_worker_info().seed
uses the base seed in each process and worker id to create a seed for each worker (by adding base_seed+worker_id
). This base_seed
does take rank into account because we set torch's seed for each process based on the rank (see https://github.com/mlfoundations/open_clip/blob/91923dfc376afb9d44577a0c9bd0930389349438/src/training/main.py#L44C1-L47C29).
So there shouldn't be overlap in seeds because the base seed is a different long for each process that is generated using its own rng (and base seeds are not consecutive for different processes). Are you observing something different?
Oh, I see! Thank you for the response. I got the same seed among nodes because I set torch.manual_seed
just based on the seed
.
I should have been more careful when migrating between training frameworks.
When using resample, the code uses
ResampledShards2
with deterministic behavior: https://github.com/mlfoundations/open_clip/blob/91923dfc376afb9d44577a0c9bd0930389349438/src/training/data.py#L313-L319where the
pytorch_worker_seed()
function relies onget_worker_info()
fromtorch.utils.data
: https://github.com/mlfoundations/open_clip/blob/91923dfc376afb9d44577a0c9bd0930389349438/src/training/data.py#L222-L233However, this function only returns information about workers in the current dataloader, and
seed
s of the sameworker id
among different nodes are the same.