Closed kmalhotra7 closed 3 years ago
cc @ebrevdo, could you PTAL at this question about LinUCB?
@bartokg can you ptal? Issue with the bandits code.
IMO the best way to emit all probabilities is to add a policy_info
that emits a vector of all probabilities. The policy can be responsible for filling it out.
The recommended way is to turn on PREDICTED_REWARDS_MEAN in the PolicyInfo.
Ok thanks.... will try that now ... and let keep you posted. thanks for the quick response on this.
Hi i am using the ClassificationBanditEnvironment provided by TF-agents and applying it to a binary classification problem, so essentially there are only 2 actions in my action space denoted by '0' and '1'. After training the model and enabling PREDICTED_REWARDS_MEAN as part of the policy_info here is the result im getting for when i run 'policy.action(context)' for one of the 'contexts'
PolicyStep(action=<tf.Tensor: shape=(1,), dtype=int32, numpy=array([1], dtype=int32)>, state=(), info=PolicyInfo(log_probability=
I have 2 follow up questions:
The log probability of both actions are still not displayed in the policy info. As you can see the 'log_probability' inside PolicyInfo still has value '0'
With respect to predicted rewards_mean , the array looks like numpy=array([[10.15142 , 10.000551]]. Doesn't that mean that mean reward for action '0' is 10.15142 and that for action '1' is 10.000551? If so why is the chosen action by the policy action '1' which has lesser reward mean and not '0'? Am i interpreting this correctly?
Any help would be greatly appreciated ! thanks
emit_policy_info
parameter. This string literal is conveniently stored in policy_utilities.InfoFields.PREDICTED_REWARDS_OPTIMISTIC
.Hi,
I added all the fields
emit_policy_info=('log_probability', 'predicted_rewards_mean', 'predicted_rewards_optimistic', 'predicted_rewards_sampled', 'bandit_policy_type'),
It works fine without the 'predicted_rewards_optimistic' field. BUT if i include it in the list, i get the following error message.
ValueError: action output and policy_step_spec structures do not match: PolicyStep(action=., state=(), info=PolicyInfo(log_probability=., predicted_rewards_mean=., predicted_rewards_optimistic=., predicted_rewards_sampled=., bandit_policy_type=())) vs. PolicyStep(action=., state=(), info=PolicyInfo(log_probability=., predicted_rewards_mean=., predicted_rewards_optimistic=(), predicted_rewards_sampled=., bandit_policy_type=()))
It seems like you caught a bug! In the linear bandit agent, the function _populate_policy_info_spec (https://github.com/tensorflow/agents/blob/master/tf_agents/bandits/policies/linear_bandit_policy.py#L346) does not populate the field "predicted_rewards_optimistic". It should be done analogously to the "predicted_reward_sampled" case. I'll send out a change tomorrow. Thanks for following up and noticing!
ok perfect! is there a way to get notified when you do so.
Also as you explained, predicted_rewards_optimistic is the mean + upper confidence bound. In that case what is predicted_rewards_sampled? Is there documentation with their explanations that you can point me to?
The change is here: https://github.com/tensorflow/agents/commit/5a84c64531d9d2881e413142038d6acec9b40df5 As for predicted_rewards_sampled: It's the same for Thompson Sampling as predicted_rewards_optimistic for LinUCB: in TS, the policy applies argmax on a set of rewards that are sampled from a distribution with means obtained from predicted_rewards_mean.
Thanks @bartokg for the fix! Closing the issue.
Hi,
I am looking for a way to output the probabilities for all the actions for a given context, but cant find a way to do so. The 'emit_log_prob' always gives the value 0 for the action chosen. i have also tried policy.distribution(context) but that didn't help either.
What I would like to see is the probabilities of all the actions for a particular context and not just for the action chosen by the policy?