Closed FarisXiong closed 6 months ago
In our case, 'K' denotes the times of sampling over the Softmax distribution. In the limit regime (the temperature goes to zero) where the sampling results are discrete, each node will have K neighbored nodes by sampling. In the ordinary regime (the temperature has a moderate value), each node will have K random attention vectors, which can be considered as K different heads (generalizing the deterministic attention to a stochastic one).
Thank you for your quick response! I appreciate your help. I have a follow-up question regarding the use of different functions across the training and inference phases. Could you explain why the ‘kernelized_gumbel_softmax’ function is used during training, while the ‘kernelized_softmax’ function is preferred during testing? I’m curious to understand the reasoning behind this choice. Thanks again for your assistance!
Thank you for your quick response! I appreciate your help. I have a follow-up question regarding the use of different functions across the training and inference phases. Could you explain why the ‘kernelized_gumbel_softmax’ function is used during training, while the ‘kernelized_softmax’ function is preferred during testing? I’m curious to understand the reasoning behind this choice. Thanks again for your assistance!
I noticed in the paper that it is recommended to use the Gumbel-Max trick to avoid the problem of over-normalization. Will using softmax in the test set still present this issue?
The over-normalizing issue mainly impacts the model training due to the potential gradient vanishing. For inference at test time, we use standard softmax to avoid the randomness of output. Otherwise, the testing results can be different in different runs.
Got it! Thank you so much for your patience. Your help is invaluable, and I truly appreciate the time and effort you put into assisting me.