Closed gblackout closed 1 year ago
@gblackout ah, well i know in the text encoder for CLIP, OpenAI used to take the [eos] token embedding as the pooled value, but then later switched to averaging all the embeddings. there is a similar story in vision transformer literature, where researchers used to rely on CLS token, but then found out global average pool learns faster and better
i honestly think this doesn't matter
i'm not familiar with the code of TRL and TRLX; do they append some pooling token at the end and then excise it out at the end? how would -1
account for variable lengths?
@gblackout obviously, i'll defer to your experiments. this is a highly empirical field, and if you present to me results that show different than my intuition, willing to change my mind
Thanks for the info. That makes a lot of sense. I haven't run TRL to know how they handle the variable length. I was digging into their code and found something like
last_hidden_state = base_model_output.decoder_hidden_states[-1]
lm_logits = base_model_output.logits
loss = base_model_output.loss
value = self.v_head(last_hidden_state).squeeze(-1)
In any case, thanks for the suggestion. I'll try both and let you know if I find something interesting.
Best
Hi there,
In your
ActorCritic.forward()
I found that you docritic_embeds = masked_mean(critic_embeds, mask, dim = 1)
And then feed thecritic_embeds
to the value head I suppose this means you average over all the action embeddings and estimate the value for it.May I ask if there is a specific reason for this? Because it seems that other implementations I found are just feeding the very last embedding (i.e.,
critic_embeds[:, -1, :]
) to the value head, which seems more intuitive to me. For example, TRL and TRLX.Best