Open spktrm opened 10 months ago
@perolat any ideas?
@lanctot is there a better channel to get in contact with @perolat - I feel as though he may have missed my email.
I just chatted with him and will send him the currently open questions later today. Is this currently the only unresolved one?
Hi,
Both this issue and this one: https://github.com/google-deepmind/open_spiel/issues/1075
Keen to hear back :)
In the example for RNaD, the importance sampling correction for get_loss_nerd is 1. This is because the example provided is the on-policy case, and there are synchronous updates of the policy between acting and learning.
My question is what needs to be changed for this example to be used in an asynchronous off-policy setting? Is it as simple as substituting the importance sampling correction for a policy ratio term? What would this look like exactly?
How could I construct the importance sampling correction for the off-policy case?