Closed Neithardt-zn closed 8 months ago
Thank you for your interest in our paper.
Like you said, we use sampled code instead of GT code when learning offset head. This is a practical choice in case the number of codes (=size of the VQ dictionary) increases.
Unlike the original BeT, VQ-BeT uses code combinations ranging from as few as 100 to as many as 60,000 or more. This means that unlike using a small number of codes, it is difficult to put the expectation on the code prediction head that all codes will be predicted accurately. Therefore, using predicted codes rather than GT codes may put this burden on the offset head, but it also relieves the offset head of some of the pressure to be 100% accurate in code prediction.
However, if the number of codebooks is not very large, then modifying the algorithm as you mentioned will not make much difference in performance.
Thanks.
Thanks for your reply. So what if there are only two modes in the dataset, and they have the same probability, which means the code predicted are equally sampled. Will the offset head learn the average of two modes or collapse to one mode?
As you said, using predicted code can occur mode collapse if there is two modes with exactly the same probability in exactly the same state.
However, this does not happen in most practical settings, and we have validated that different trajectories can be generated from the exact same state using VQ-BeT.
Since this option was already implemented in our in-house code, we thought that it would be better to provide this option (use GT code / use predicted code) to user. We'll update the code to incorporate using GT code option soon.
Thank you.
Got it! Thank you for the explanation!
Hi Thanks for your excellent work, in your code, when training offset head, you used action sampled from code prediction head rather than the GT code. Is that a practical choice? Since I found in BeT, they use GT class when training offset head. I am not sure if it will be a problem when two codes have similar probability.