Closed Jason3900 closed 2 months ago
@Jason3900 yes you are right, thank you
You're welcome. Thanks for your excellent project!
@Jason3900 no problem, go train something amazing with it
Is this also relevant for all_reduce
calls in vector_quantize_pytorch.py such as this?
@hummat i don't think so, because they are only used during the ema or kmeans, both without needing gradients
https://github.com/lucidrains/vector-quantize-pytorch/blob/e99128dd6d780cc7b97cc5c4f37a99b05a834c57/vector_quantize_pytorch/lookup_free_quantization.py#L13
The maybe_distributed_mean may result in incorrect results, as applying torch.distributed.all_reduce will not flow the gradient back to each GPU using c10d communication backend. related discussion
Thus, the proposed way is to use
from torch.distributed import nn as dist_nn
, and usedist_nn.all_reduce
operator to retain the gradient. Otherwise, a warning will show and the result is incorrect(The codebook entropy loss will be abnormal according to my experiments with LFQ).