Is actions_prob calculation correct?

juncongmoo / chatllama

ChatLLaMA 📢 Open source implementation for LLaMA-based ChatGPT runnable in a single GPU. 15x faster training process than ChatGPT

1.2k stars 138 forks source link

Is actions_prob calculation correct? #7

Open sandwriter opened 1 year ago

sandwriter commented 1 year ago

The following code uses the action logit value for the optimal action, and then diff against the log prob of the action from the last actor model iteration. Should we instead pick the action from old_actions instead just max, so that we are comparing the prob for the same action from two iterations?

                # get action log prob
                actions_prob = (
                    torch.softmax(actions_logits, dim=-1).max(dim=-1).values
                )