Open segalinc opened 1 year ago
I think this related to the FID metrics as if I remove it all works
when I try to train on a multi gpu machine (resetting fspd to true) and uncommenting last two lines of the config and batch size accordingly I get this error
ValueError: The world_size(2) > 1 but dataloader does not use DistributedSampler. This will cause all ranks to train on the same data, removing any benefit from multi-GPU training. To resolve this, create a Dataloader with DistributedSampler. For example, DataLoader(..., sampler=composer.utils.dist.get_sampler(...)).Alternatively, the process group can be instantiated with composer.utils.dist.instantiate_dist(...) and DistributedSampler can directly be created with DataLoader(..., sampler=DistributedSampler(...)). For more information, see https://pytorch.org/docs/stable/data.html#torch.utils.data.distributed.DistributedSampler.
I don't see a distributesampler for the laion or coco functions
Hi thanks for this contribution as a small exercise I am training SD2 on the pokemon dataset I precomputed the latents and it starts training on one gpu However at the evaluation time I get the following error
this is my confguration
``