tensorflow / agents

TF-Agents: A reliable, scalable and easy to use TensorFlow library for Contextual Bandits and Reinforcement Learning.
Apache License 2.0
2.81k stars 720 forks source link

Probability of all actions in Contextual Bandits (LinUCB) #592

Closed kmalhotra7 closed 3 years ago

kmalhotra7 commented 3 years ago

Hi,

I am looking for a way to output the probabilities for all the actions for a given context, but cant find a way to do so. The 'emit_log_prob' always gives the value 0 for the action chosen. i have also tried policy.distribution(context) but that didn't help either.

What I would like to see is the probabilities of all the actions for a particular context and not just for the action chosen by the policy?

egonina commented 3 years ago

cc @ebrevdo, could you PTAL at this question about LinUCB?

ebrevdo commented 3 years ago

@bartokg can you ptal? Issue with the bandits code.

ebrevdo commented 3 years ago

IMO the best way to emit all probabilities is to add a policy_info that emits a vector of all probabilities. The policy can be responsible for filling it out.

efiko commented 3 years ago

The recommended way is to turn on PREDICTED_REWARDS_MEAN in the PolicyInfo.

kmalhotra7 commented 3 years ago

Ok thanks.... will try that now ... and let keep you posted. thanks for the quick response on this.

kmalhotra7 commented 3 years ago

Hi i am using the ClassificationBanditEnvironment provided by TF-agents and applying it to a binary classification problem, so essentially there are only 2 actions in my action space denoted by '0' and '1'. After training the model and enabling PREDICTED_REWARDS_MEAN as part of the policy_info here is the result im getting for when i run 'policy.action(context)' for one of the 'contexts'

PolicyStep(action=<tf.Tensor: shape=(1,), dtype=int32, numpy=array([1], dtype=int32)>, state=(), info=PolicyInfo(log_probability=, predicted_rewards_mean=<tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[10.15142 , 10.000551]], dtype=float32)>, predicted_rewards_optimistic=(), predicted_rewards_sampled=(), bandit_policy_type=()))

I have 2 follow up questions:

  1. The log probability of both actions are still not displayed in the policy info. As you can see the 'log_probability' inside PolicyInfo still has value '0'

  2. With respect to predicted rewards_mean , the array looks like numpy=array([[10.15142 , 10.000551]]. Doesn't that mean that mean reward for action '0' is 10.15142 and that for action '1' is 10.000551? If so why is the chosen action by the policy action '1' which has lesser reward mean and not '0'? Am i interpreting this correctly?

Any help would be greatly appreciated ! thanks

bartokg commented 3 years ago
  1. LinUCB is a deterministic algorithm, so the log-probabilities are not filled with anything meaningful.
  2. LinUCB chooses an action based on upper confidence bounds, which are calculated as the predicted mean rewards plus some bonus. The argmax is taken over these values, and they are stored in the policy info "predicted_rewards_optimistic", but only if you list this string in the agent's emit_policy_info parameter. This string literal is conveniently stored in policy_utilities.InfoFields.PREDICTED_REWARDS_OPTIMISTIC.
kmalhotra7 commented 3 years ago

Hi,

I added all the fields

emit_policy_info=('log_probability', 'predicted_rewards_mean', 'predicted_rewards_optimistic', 'predicted_rewards_sampled', 'bandit_policy_type'),

It works fine without the 'predicted_rewards_optimistic' field. BUT if i include it in the list, i get the following error message.

ValueError: action output and policy_step_spec structures do not match: PolicyStep(action=., state=(), info=PolicyInfo(log_probability=., predicted_rewards_mean=., predicted_rewards_optimistic=., predicted_rewards_sampled=., bandit_policy_type=())) vs. PolicyStep(action=., state=(), info=PolicyInfo(log_probability=., predicted_rewards_mean=., predicted_rewards_optimistic=(), predicted_rewards_sampled=., bandit_policy_type=()))

bartokg commented 3 years ago

It seems like you caught a bug! In the linear bandit agent, the function _populate_policy_info_spec (https://github.com/tensorflow/agents/blob/master/tf_agents/bandits/policies/linear_bandit_policy.py#L346) does not populate the field "predicted_rewards_optimistic". It should be done analogously to the "predicted_reward_sampled" case. I'll send out a change tomorrow. Thanks for following up and noticing!

kmalhotra7 commented 3 years ago

ok perfect! is there a way to get notified when you do so.

Also as you explained, predicted_rewards_optimistic is the mean + upper confidence bound. In that case what is predicted_rewards_sampled? Is there documentation with their explanations that you can point me to?

bartokg commented 3 years ago

The change is here: https://github.com/tensorflow/agents/commit/5a84c64531d9d2881e413142038d6acec9b40df5 As for predicted_rewards_sampled: It's the same for Thompson Sampling as predicted_rewards_optimistic for LinUCB: in TS, the policy applies argmax on a set of rewards that are sampled from a distribution with means obtained from predicted_rewards_mean.

oars commented 3 years ago

Thanks @bartokg for the fix! Closing the issue.