Closed kasurashan closed 2 months ago
I calculate the entropy loss of the logits for the first next token
. The original AdaMerging paper exclusively experimented with CLIP-ViT models on image classification tasks. For Flan-T5 models, I treat it as a token classification task.
An implementation can be found here, which I will soon integrate into this codebase: https://github.com/tanganke/subspace_fusion/blob/main/scripts/flan_t5_layer_wise_adamerging.py#L81
This issue seems to be resolved. I am closing it. If you have any other questions, please feel free to reopen it.
Hello. I really appreciate you sharing this insightful research paper along with the code I have a question regarding the application of AdaMerging to Flan-T5 (table 13). Could you explain how you computed the loss in this case? Did you simply apply self-entropy to the logits with the shape [batch, seq_len, vocab_size] that come from the model's output? Thank you.