tensorflow / agents

TF-Agents: A reliable, scalable and easy to use TensorFlow library for Contextual Bandits and Reinforcement Learning.
Apache License 2.0
2.79k stars 722 forks source link

Can tf.agent policy return probability vector for all actions? #454

Open bing-zhao opened 4 years ago

bing-zhao commented 4 years ago

I am trying to train a Reinforcement Learning agent using TF-Agent TF-Agent DQN Tutorial. In my application, I have 9 discrete actions (labeled 0 to 8), and I would like to get the probability vector contains all actions calculated by the trained policy, and do further processing in other application environments. However, the policy only returns log_probability with a single value rather than a vector for all actions. Is there anyway to get the probability vector?

from tf_agents.networks import q_network
from tf_agents.agents.dqn import dqn_agent

q_net = q_network.QNetwork(
            env.observation_spec(),
            env.action_spec(),
            fc_layer_params=(32,)
        )

optimizer = tf.compat.v1.train.AdamOptimizer(learning_rate=0.001)

my_agent = dqn_agent.DqnAgent(
    env.time_step_spec(),
    env.action_spec(),
    q_network=q_net,
    epsilon_greedy=epsilon,
    optimizer=optimizer,
    emit_log_probability=True,
    td_errors_loss_fn=common.element_wise_squared_loss,
    train_step_counter=global_step)

my_agent.initialize()

...  # training

tf_policy_saver = policy_saver.PolicySaver(my_agent.policy)
tf_policy_saver.save('./policy_dir/')

# making decision using the trained policy
action_step = my_agent.policy.action(time_step)

In dqn_agent.DqnAgent(), I set emit_log_probability=True, which is supposed to define Whether policies emit log probabilities or not.

However, when I run action_step = my_agent.policy.action(time_step), it returns PolicyStep(action=<tf.Tensor: shape=(1,), dtype=int64, numpy=array([1], dtype=int64)>, state=(), info=PolicyInfo(log_probability=<tf.Tensor: shape=(1,), dtype=float32, numpy=array([0.], dtype=float32)>))

I also tried to run action_distribution = saved_policy.distribution(time_step), It returns PolicyStep(action=<tfp.distributions.DeterministicWithLogProbCT 'Deterministic' batch_shape=[1] event_shape=[] dtype=int64>, state=(), info=PolicyInfo(log_probability=<tf.Tensor: shape=(), dtype=float32, numpy=0.0>))

If there is no such API available in TF.Agent, is there a way to get such probability vector? Thanks.

summer-yue commented 4 years ago

Ummm could you print out env.action_spec() for your environment?

bing-zhao commented 4 years ago

Ummm could you print out env.action_spec() for your environment?

Thanks for the response. The output for env.action_spec() is as below BoundedTensorSpec(shape=(), dtype=tf.int64, name='action', minimum=array(0, dtype=int64), maximum=array(8, dtype=int64))

By the way, is there a way to return the Q-value for different actions? If that is possible, maybe I can just pass the Q-value into a softmax function and get the probabilities?

David-zreo commented 3 years ago

Hello,Bing. I have the same problem as you , could you tell me how you solved it , please.

apurva-octro commented 2 years ago

Hello @bing-zhao I am also facing same issue. Did you got any solution to get Q-values?

FalsitaFine commented 1 year ago

might be helpful: go to the greedy_policy.py, and find the function def _distribution(self, time_step, policy_state) Where we find that this function returns DeterministicWithLogProb(loc=greedy_action) where greedy_action = dist.mode(), and that is why it is always a binary sequences. If you want probabilities for each action, dist.prob(that action) is what you need.