harshraj22 / rl_lab

Contains Solutions of Lab assignments of Reinforcement Learning lab
0 stars 0 forks source link

Reinforce Without Baseline #8

Closed harshraj22 closed 2 years ago

harshraj22 commented 2 years ago

Reinforce without baseline doesn't work yet. As per banditsComparision.pdf, if reward r(t) is zero, and baseline is not used, while keeping preference initialization same for all arms, the policy does not change at all.

image

harshraj22 commented 2 years ago

The arm is chosen by sampling the preference. softmax is used over the preference, not argmax