Thank you for releasing the code for your paper. It is fascinating work. I have one question specific to the implementation.
When the [RET] token is added, the embedding layer is updated along with the final classification layer. Specifically, the output dimension of the FC layer is updated to 32001. However, you freeze all the layers in LLM. How does this work during training when you have the next token prediction training?
Hello,
Thank you for releasing the code for your paper. It is fascinating work. I have one question specific to the implementation.
When the [RET] token is added, the embedding layer is updated along with the final classification layer. Specifically, the output dimension of the FC layer is updated to 32001. However, you freeze all the layers in LLM. How does this work during training when you have the next token prediction training?