Sorry to bother you. I have tried your code and found that the one-hot vector with gumbel softmax is generated with some-linear --> softmax --> F.gumbel_softmax. However, in the code implementation of the DynamicViT, the one-hot vector is generated with some-linear --> Log-softmax --> F.gumbel_softmax. Is there some difference between the two, or whether it can influence the performance?
Hi, authors.
Sorry to bother you. I have tried your code and found that the one-hot vector with gumbel softmax is generated with
some-linear --> softmax --> F.gumbel_softmax
. However, in the code implementation of the DynamicViT, the one-hot vector is generated withsome-linear --> Log-softmax --> F.gumbel_softmax
. Is there some difference between the two, or whether it can influence the performance?Thx.