DQN Agent.policy() only returns same action out of 3 throughout the episode. random_policy works correctly.

I have tf-environment for trading where I have 3 actions: Skip | Buy | Sell. For training part I'm following the DQN Agent. When I calculate average_return with random_policy I'm getting results like this which make sense:

Finished, Current_step:  8400
Money:  859.0181506800005
Trade Count: 2918

However when I use agent.policy it only takes action 0 which is Skip action, therefore after one episode it prints like this:

Finished, Current_step:  8400
Money:  1000
Trade Count: 0

I started to use tf agents recently, I found the code pretty easy to understand however there are lack of documentation and tutorials.

Since I know that random_policy works and creates meaningful results there must be something off with the agent.policy part. I'm adding the notebook here: DQN Training Notebook

As you can see the loss is so huge therefore after some time it will throw an error for inf or Nan value for the loss.

You can also see the code blocks below (same as notebook if you cannot open it) Here is the code block that I'm using for my training part:

train_env = tf_py_environment.TFPyEnvironment(StockMarketEnvironment(config))
eval_env = tf_py_environment.TFPyEnvironment(StockMarketEnvironment(config))

fc_layer_params = (100, 50)
action_tensor_spec = tensor_spec.from_spec(train_env.action_spec())
num_actions = action_tensor_spec.maximum - action_tensor_spec.minimum + 1

# Define a helper function to create Dense layers configured with the right
# activation and kernel initializer.
def dense_layer(num_units):
  return tf.keras.layers.Dense(
      num_units,
      activation=tf.keras.activations.relu,
      kernel_initializer=tf.keras.initializers.VarianceScaling(
          scale=2.0, mode='fan_in', distribution='truncated_normal'))

# QNetwork consists of a sequence of Dense layers followed by a dense layer
# with `num_actions` units to generate one q_value per available action as
# it's output.
dense_layers = [dense_layer(num_units) for num_units in fc_layer_params]
q_values_layer = tf.keras.layers.Dense(
    num_actions,
    activation=None,
    kernel_initializer=tf.keras.initializers.RandomUniform(
        minval=-0.03, maxval=0.03),
    bias_initializer=tf.keras.initializers.Constant(-0.2))
q_net = sequential.Sequential(dense_layers + [q_values_layer])

optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)

train_step_counter = tf.Variable(0)

agent = dqn_agent.DqnAgent(
    train_env.time_step_spec(),
    train_env.action_spec(),
    q_network=q_net,
    optimizer=optimizer,
    td_errors_loss_fn=common.element_wise_squared_loss,
    train_step_counter=train_step_counter)

agent.initialize()

eval_policy = agent.policy
collect_policy = agent.collect_policy
random_policy = random_tf_policy.RandomTFPolicy(train_env.time_step_spec(),
                                                train_env.action_spec())

def compute_avg_return(environment, policy, num_episodes=10):

  total_return = 0.0
  for _ in range(num_episodes):

    time_step = environment.reset()
    episode_return = 0.0

    while not time_step.is_last():
      action_step = policy.action(time_step)
      time_step = environment.step(action_step.action)
      episode_return += time_step.reward
    total_return += episode_return

  avg_return = total_return / num_episodes
  return avg_return.numpy()[0]

# See also the metrics module for standard implementations of different metrics.
# https://github.com/tensorflow/agents/tree/master/tf_agents/metrics

compute_avg_return(eval_env, random_policy, num_eval_episodes)
$Console:
Finished, Current_step:  8400
Money:  856.4146823100001
Trade Count: 2848
Finished, Current_step:  8400
Money:  857.7814843700004
Trade Count: 2850
Finished, Current_step:  8400
Money:  860.4180388899999
Trade Count: 2851
Finished, Current_step:  8400
Money:  865.7346459800023
Trade Count: 2769
Finished, Current_step:  8400
Money:  846.4235596200016
Trade Count: 2818
Finished, Current_step:  8400
Money:  858.7975110799977
Trade Count: 2748
Finished, Current_step:  8400
Money:  870.9079134400006
Trade Count: 2756
Finished, Current_step:  8400
Money:  856.4682145700002
Trade Count: 2782
Finished, Current_step:  8400
Money:  859.2196140200002
Trade Count: 2806
Finished, Current_step:  8400
Money:  881.0172930299997
Trade Count: 2848
-6.9715714

replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
    data_spec=agent.collect_data_spec,
    batch_size=train_env.batch_size,
    max_length=replay_buffer_max_length)

def collect_step(environment, policy, buffer):
  time_step = environment.current_time_step()
  action_step = policy.action(time_step)
  next_time_step = environment.step(action_step.action)
  traj = trajectory.from_transition(time_step, action_step, next_time_step)

  # Add trajectory to the replay buffer
  buffer.add_batch(traj)

def collect_data(env, policy, buffer, steps):
  for _ in range(steps):
    collect_step(env, policy, buffer)

collect_data(train_env, random_policy, replay_buffer, initial_collect_steps)

# This loop is so common in RL, that we provide standard implementations. 
# For more details see tutorial 4 or the drivers module.
# https://github.com/tensorflow/agents/blob/master/docs/tutorials/4_drivers_tutorial.ipynb 
# https://www.tensorflow.org/agents/api_docs/python/tf_agents/drivers

# Dataset generates trajectories with shape [Bx2x...]
dataset = replay_buffer.as_dataset(
    num_parallel_calls=3, 
    sample_batch_size=batch_size, 
    num_steps=2).prefetch(3)

dataset

$Console:
WARNING:tensorflow:AutoGraph could not transform <bound method ReplayBuffer.get_next of <tf_agents.replay_buffers.tf_uniform_replay_buffer.TFUniformReplayBuffer object at 0x7f1b556a9370>> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Index'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING: AutoGraph could not transform <bound method ReplayBuffer.get_next of <tf_agents.replay_buffers.tf_uniform_replay_buffer.TFUniformReplayBuffer object at 0x7f1b556a9370>> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Index'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:From /home/ege/anaconda3/envs/gym_2/lib/python3.9/site-packages/tf_agents/replay_buffers/tf_uniform_replay_buffer.py:338: ReplayBuffer.get_next (from tf_agents.replay_buffers.replay_buffer) is deprecated and will be removed in a future version.
Instructions for updating:
Use `as_dataset(..., single_deterministic_pass=False) instead.
<PrefetchDataset shapes: (Trajectory(
{action: (64, 2),
 discount: (64, 2),
 next_step_type: (64, 2),
 observation: (64, 2, 17),
 policy_info: (),
 reward: (64, 2),
 step_type: (64, 2)}), BufferInfo(ids=(64, 2), probabilities=(64,))), types: (Trajectory(
{action: tf.int32,
 discount: tf.float32,
 next_step_type: tf.int32,
 observation: tf.float32,
 policy_info: (),
 reward: tf.float32,
 step_type: tf.int32}), BufferInfo(ids=tf.int64, probabilities=tf.float32))>


iterator = iter(dataset)
print(iterator)

$Console:
<tensorflow.python.data.ops.iterator_ops.OwnedIterator object at 0x7f1b55036b50>

try:
  %%time
except:
  pass

# (Optional) Optimize by wrapping some of the code in a graph using TF function.
#agent.train = common.function(agent.train)

# Reset the train step
agent.train_step_counter.assign(0)

# Evaluate the agent's policy once before training.
avg_return = compute_avg_return(eval_env, agent.policy, num_eval_episodes)
returns = [avg_return]

for _ in range(num_iterations):

  # Collect a few steps using collect_policy and save to the replay buffer.
  collect_data(train_env, agent.collect_policy, replay_buffer, collect_steps_per_iteration)

  # Sample a batch of data from the buffer and update the agent's network.
  experience, unused_info = next(iterator)
  train_loss = agent.train(experience).loss

  step = agent.train_step_counter.numpy()

  if step % log_interval == 0:
    print('step = {0}: loss = {1}'.format(step, train_loss))

  if step % eval_interval == 0:
    avg_return = compute_avg_return(eval_env, agent.policy, num_eval_episodes)
    print('step = {0}: Average Return = {1}'.format(step, avg_return))
    returns.append(avg_return)

$Console:
Finished, Current_step:  8400
Money:  1000
Trade Count: 0
Finished, Current_step:  8400
Money:  1000
Trade Count: 0
Finished, Current_step:  8400
Money:  1000
Trade Count: 0
Finished, Current_step:  8400
Money:  1000
Trade Count: 0
Finished, Current_step:  8400
Money:  1000
Trade Count: 0
Finished, Current_step:  8400
Money:  1000
Trade Count: 0
Finished, Current_step:  8400
Money:  1000
Trade Count: 0
Finished, Current_step:  8400
Money:  1000
Trade Count: 0
Finished, Current_step:  8400
Money:  1000
Trade Count: 0
Finished, Current_step:  8400
Money:  1000
Trade Count: 0
WARNING:tensorflow:From /home/ege/anaconda3/envs/gym_2/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:201: calling foldr_v2 (from tensorflow.python.ops.functional_ops) with back_prop=False is deprecated and will be removed in a future version.
Instructions for updating:
back_prop=False is deprecated. Consider using tf.stop_gradient instead.
Instead of:
results = tf.foldr(fn, elems, back_prop=False)
Use:
results = tf.nest.map_structure(tf.stop_gradient, tf.foldr(fn, elems))
step = 500: loss = 10704652337152.0
step = 1000: loss = 1067789401980928.0
step = 1500: loss = 1.5638079003951104e+16
step = 2000: loss = 1.3182904757859123e+17
step = 2500: loss = 1.4300397695716557e+17
step = 3000: loss = 6.319313200831529e+17
step = 3500: loss = 1.1142676922960445e+18
Finished, Current_step:  8400
Money:  954.0970968899992
Trade Count: 886
step = 4000: loss = 6.692395363199877e+17
step = 4500: loss = 3.776880715526832e+18
step = 5000: loss = 1.2178947253246886e+19
Finished, Current_step:  8400
Money:  1000
Trade Count: 1
Finished, Current_step:  8400
Money:  1000
Trade Count: 1
Finished, Current_step:  8400
Money:  1000
Trade Count: 1
Finished, Current_step:  8400
Money:  1000
Trade Count: 1
Finished, Current_step:  8400
Money:  1000
Trade Count: 1
Finished, Current_step:  8400
Money:  1000
Trade Count: 1
Finished, Current_step:  8400
Money:  1000
Trade Count: 1
Finished, Current_step:  8400
Money:  1000
Trade Count: 1
Finished, Current_step:  8400
Money:  1000
Trade Count: 1
Finished, Current_step:  8400
Money:  1000
Trade Count: 1
step = 5000: Average Return = -4.1414265632629395
step = 5500: loss = 5.794712701158556e+18
step = 6000: loss = 1.5195534369864286e+19
step = 6500: loss = 1.7819131841458733e+19
step = 7000: loss = 1.5725802439661584e+19
step = 7500: loss = 3.1736417973335753e+19
Finished, Current_step:  8400
Money:  984.3364667000004
Trade Count: 547
step = 8000: loss = 3.4682563778008056e+19
step = 8500: loss = 5.1093452121727566e+19
step = 9000: loss = 1.510782104966078e+20
step = 9500: loss = 1.6112591718713183e+20
step = 10000: loss = 1.517131124909508e+20

As you can see the loss is so huge therefore after some time it will throw an error for inf or Nan value for the loss.

tensorflow / agents

DQN Agent.policy() only returns same action out of 3 throughout the episode. random_policy works correctly. #636