Closed LuisBlanche closed 5 months ago
@LuisBlanche if you use horovod you can only use webdataset, otherwise you have to write your own DistributedSampler. We never found much benefit to using horovod over distributed torch training...
hmm, if code is added to check the args.horovod in the csv dataset create fn, you could get the world details (rank & world_size) from hvd and then explicity pass to DistributedSampler so that it doesn't try to look them up from torch.dist
class DistributedSampler(Sampler[T_co]):
r"""Sampler that restricts data loading to a subset of the dataset.
It is especially useful in conjunction with
:class:`torch.nn.parallel.DistributedDataParallel`. In such a case, each
process can pass a :class:`~torch.utils.data.DistributedSampler` instance as a
:class:`~torch.utils.data.DataLoader` sampler, and load a subset of the
original dataset that is exclusive to it.
.. note::
Dataset is assumed to be of constant size and that any instance of it always
returns the same elements in the same order.
Args:
dataset: Dataset used for sampling.
num_replicas (int, optional): Number of processes participating in
distributed training. By default, :attr:`world_size` is retrieved from the
current distributed group.
rank (int, optional): Rank of the current process within :attr:`num_replicas`.
By default, :attr:`rank` is retrieved from the current distributed
group.
shuffle (bool, optional): If ``True`` (default), sampler will shuffle the
indices.
seed (int, optional): random seed used to shuffle the sampler if
:attr:`shuffle=True`. This number should be identical across all
processes in the distributed group. Default: ``0``.
drop_last (bool, optional): if ``True``, then the sampler will drop the
tail of the data to make it evenly divisible across the number of
replicas. If ``False``, the sampler will add extra indices to make
the data evenly divisible across the replicas. Default: ``False``.
.. warning::
In distributed mode, calling the :meth:`set_epoch` method at
the beginning of each epoch **before** creating the :class:`DataLoader` iterator
is necessary to make shuffling work properly across multiple epochs. Otherwise,
the same ordering will be always used.
Example::
>>> # xdoctest: +SKIP
>>> sampler = DistributedSampler(dataset) if is_distributed else None
>>> loader = DataLoader(dataset, shuffle=(sampler is None),
... sampler=sampler)
>>> for epoch in range(start_epoch, n_epochs):
... if is_distributed:
... sampler.set_epoch(epoch)
... train(loader)
"""
def __init__(self, dataset: Dataset, num_replicas: Optional[int] = None,
rank: Optional[int] = None, shuffle: bool = True,
seed: int = 0, drop_last: bool = False) -> None:
THanks for the quick reply ! If i work with multiple GPU machine (only one node) for instance, will the code automatically use torch distributed training ? If I understand, on non-SLURM custer for instance, we would need to have a "WORLD_SIZE" environment variable preset ?
https://github.com/mlfoundations/open_clip?tab=readme-ov-file#single-node if you use either the single or multi node (both are multi-gpu) torchrun commands in the README all those vars are set and the torch.dist
functions work
Context
Trying to finetune pretrained CLIP models on databricks using horovod
Using the following parameters :
PROBLEM
I get the following runtime error when the code tries to instantiate the
DistributedSampler
:I can see that indeed when using horovod the
init_process_group
method is never called cf : https://github.com/mlfoundations/open_clip/blob/3ff1faf10b60be27252be7f6c84ce7c8c5e14ec8/src/training/distributed.py#L63-L114We can see that it is only called when horovod is not used. I don't know horovod, nor distributed training well enough to know if this is the expected behavior !