Absorbing states handling when updating policy

Hi, thanks for you interest in LSIQ! When you talk about samples and absorbing states, I guess you mean transitions that lead to absorbing states. The answer is that these samples are generally used, but the target value for the next states is then set to 0, which is the core problem of the absorbing state handling in IQ-Learn. If I remember correctly, DAC has an indicator flag within the state, which determines whether a state is absorbing or not. Then, they learn specific values for the absorbing states. In LSIQ, we do a similar thing. Instead of setting the value of an absorbing state to 0 (as IQ-Learn), we explicitly learn the value of an absorbing state. The value of an absorbing state depends on whether an absorbing state happens under the policy or the expert (more on that in the paper).

For LSIQ, you can use the iq_like_loss, or the sqil_like_loss, whereas we found that the sqil_like_loss works better with absorbing states. I will close this issue as it is not generally related to the code, but is rather on the theory of the paper. If you have more questions, feel free to send me an email (fi.alhafez@gmail.com). I would be happy to help!

robfiras / ls-iq

Absorbing states handling when updating policy #2