Closed harshraj22 closed 2 years ago
Reinforce without baseline doesn't work yet. As per banditsComparision.pdf, if reward r(t) is zero, and baseline is not used, while keeping preference initialization same for all arms, the policy does not change at all.
The arm is chosen by sampling the preference. softmax is used over the preference, not argmax
Reinforce without baseline doesn't work yet. As per banditsComparision.pdf, if reward r(t) is zero, and baseline is not used, while keeping preference initialization same for all arms, the policy does not change at all.