Clarification on the Role of 'K' in kernelized_gumbel_softmax Function

qitianwu / NodeFormer

The official implementation of NeurIPS22 spotlight paper "NodeFormer: A Scalable Graph Structure Learning Transformer for Node Classification"

296 stars 29 forks source link

Clarification on the Role of 'K' in kernelized_gumbel_softmax Function #20

Closed FarisXiong closed 6 months ago

FarisXiong commented 6 months ago

I have a question regarding the 'kernelized_gumbel_softmax' function. Specifically, I am curious about the role of 'K' in this context. Does 'K' serve a purpose similar to the number of heads in multihead attention?

qitianwu commented 6 months ago

In our case, 'K' denotes the times of sampling over the Softmax distribution. In the limit regime (the temperature goes to zero) where the sampling results are discrete, each node will have K neighbored nodes by sampling. In the ordinary regime (the temperature has a moderate value), each node will have K random attention vectors, which can be considered as K different heads (generalizing the deterministic attention to a stochastic one).

FarisXiong commented 6 months ago

Thank you for your quick response! I appreciate your help. I have a follow-up question regarding the use of different functions across the training and inference phases. Could you explain why the ‘kernelized_gumbel_softmax’ function is used during training, while the ‘kernelized_softmax’ function is preferred during testing? I’m curious to understand the reasoning behind this choice. Thanks again for your assistance!

FarisXiong commented 6 months ago

Thank you for your quick response! I appreciate your help. I have a follow-up question regarding the use of different functions across the training and inference phases. Could you explain why the ‘kernelized_gumbel_softmax’ function is used during training, while the ‘kernelized_softmax’ function is preferred during testing? I’m curious to understand the reasoning behind this choice. Thanks again for your assistance!

I noticed in the paper that it is recommended to use the Gumbel-Max trick to avoid the problem of over-normalization. Will using softmax in the test set still present this issue?

qitianwu commented 6 months ago

The over-normalizing issue mainly impacts the model training due to the potential gradient vanishing. For inference at test time, we use standard softmax to avoid the randomness of output. Otherwise, the testing results can be different in different runs.

FarisXiong commented 6 months ago

Got it! Thank you so much for your patience. Your help is invaluable, and I truly appreciate the time and effort you put into assisting me.