Clarification on the RL problem

Hey! Thanks for the previous clarification. I'm trying to implement this with just regular Q learning (I know that the paper suggests soft-q, i wanted to compare both approaches for a paper), and even after 1k steps the accuracy on a classification problem doesn't seem to have any sign of actually learning, so I wanted to confirm whether the approach is right.

I've understood the problem as 1) initialize a single token of a prompt 2) generate prompt of size T (Z_1) 3) find reward 4) using the last token generated for the prompt Z_1, generate a new prompt Z using the target network 5) backprop on the Q_1 values found for Z_1 with the Q values found for Z in the target and the reward (mse between Q_1 and reward+discountQ as the loss). In this case, the reward is scalar and broadcasts on the discounted Q value 6) using the last token in Z_1, find Z_2 and repeat steps 3-6 again

Does this sound right? https://colab.research.google.com/drive/1fs9lILaBEqJs9ieF2lnH8fgwZSidJcw0?authuser=1#scrollTo=6mRz6CFFbVzV this is the first version of the code i've implemented, and even with a decaying epsilon, I tend to find prompts that converge to a single word repeated T times, and the max accuracy I get is .675. I'm using Distilbert as my masked language model instead of Roberta due to memory and hardware constraints. Any help would be greatly appreciated. Thank you!

mingkaid / rl-prompt

Clarification on the RL problem #28