Hi, thank you very much for your article. I have doubts about some of the details in the article.
GKT uses the BEV(HWD) reference point to obtain the vision token from the image to form a VSCKK feature, then the Decoder's memory is HWD VSC Is the token composed of the characteristics of KK?
Hi, thank you very much for your article. I have doubts about some of the details in the article. GKT uses the BEV(HWD) reference point to obtain the vision token from the image to form a VSCKK feature, then the Decoder's memory is HWD VSC Is the token composed of the characteristics of KK?