RNaD off policy case - Githubissues

spktrm commented 10 months ago

In the example for RNaD, the importance sampling correction for get_loss_nerd is 1. This is because the example provided is the on-policy case, and there are synchronous updates of the policy between acting and learning.

My question is what needs to be changed for this example to be used in an asynchronous off-policy setting? Is it as simple as substituting the importance sampling correction for a policy ratio term? What would this look like exactly?

How could I construct the importance sampling correction for the off-policy case?

spktrm commented 9 months ago

@perolat any ideas?

spktrm commented 5 months ago

@lanctot is there a better channel to get in contact with @perolat - I feel as though he may have missed my email.

lanctot commented 5 months ago

I just chatted with him and will send him the currently open questions later today. Is this currently the only unresolved one?

spktrm commented 5 months ago

Hi,

Both this issue and this one: https://github.com/google-deepmind/open_spiel/issues/1075

Keen to hear back :)

google-deepmind / open_spiel

RNaD off policy case #1109