Incorrect implementation of REINFORCE algorithm

google-research / recsim_ng

RecSim NG: Toward Principled Uncertainty Modeling for Recommender Ecosystems

https://github.com/google-research/recsim_ng

Apache License 2.0

116 stars 16 forks source link

Incorrect implementation of REINFORCE algorithm #3

Open ehtsham opened 2 years ago

ehtsham commented 2 years ago

last_state = tf_runtime.execute(num_steps=horizon - 1) last_metric_value = last_state['metrics state'].get(metric_to_optimize) log_prob = last_state['slate docs_log_prob_accum'].get('doc_ranks') objective = -tf.tensordot(tf.stop_gradient(last_metric_value), log_prob, 1)

In this above code snippet (interest_evolution_simulation.py, line 45-48), the cumulative_log_prob (across the entire episode) is being weighted by the total cumulative reward from the start to the end of the episode. This seems incorrect as the log_prob at step 't' should be weighted by cumulative reward from step 't' onwards (not from the start).

mmladenov-google commented 2 years ago

Hi Ehtsham,

I believe you might be misreading this, please correct me if I misunderstood you:

here, we implement the most naive form of REINFORCE, which weights the total log-prob of the trajectory by the total reward, i.e. $\left(\sum_t R(a_t)\right)\cdot\left(\sumt \log p\theta(a_t)\right)$. log_prob is thus a tensor of batch_size x 1 containing the sum of all action log probs in an individual trajectory and last_metric_value accordingly contains the sum of all rewards (again batch_size x 1). I believe you're interpreting log_prob to be a tensor of batch_size x T containing cumulative sums of log probs, which would have been the case had we called tf_runtime.trajectory instead of tf_runtime.execute and then the reward tensor would have been adjusted as you suggest. Is my reading correct?

Best, Martin

ehtsham commented 2 years ago

Hi Martin, you correctly understand my point and thanks for elaborating the implementation, it is exactly what I also interpreted. I understood that the log_prob tensor was only batch_size x 1 which gets weighted by the total reward of the trajectory. This naive implementation implies that the the log_probability term of an action taken at time 't' has a weight that includes rewards generated prior to time 't'. The returns are supposed to be computed from time 't' "onwards". Yes the code should have used tf_runtime.trajectory function instead although there might be more changes needed so that the cumulative reward for each action at time 't' is generated from that time onwards (currently that is not happening).

mmladenov-google commented 2 years ago

I think that for the case of finite horizons, both approaches end up estimating the same quantity. You're describing the nested version which is derived from differentiating the Bellman equation, but we use the non-recursive version derived directly from the flat form of the reward expectation, see e.g. the last equation of slide 8 here. We chose the latter for simplicity here.

ehtsham commented 2 years ago

thanks Martin, not sure if the two would be equivalent. For example, its not clear how one would study the impact of discounting in this implementation. I ll try to contribute the other implementation in the project :)

mmladenov-google commented 2 years ago

Oh, one more thing I realized when thinking this through. I suspect that the recursive gradient estimator might not be correct in the case where the agent suffers from an aliased state representation (e.g. in a POMDP with incomplete belief state representation as is the case in this example), since the Bellman equation holds only in a local sense. E.g. Theorem 1 in this paper. In this regime, the state transition probabilities become policy-dependent and will introduce to additional terms in the derivative. This could lead to interesting bias-variance trade-offs. I'd be curious to see what happens if you have the time to experiment with it.