google-research / seed_rl

SEED RL: Scalable and Efficient Deep-RL with Accelerated Central Inference. Implements IMPALA and R2D2 algorithms in TF2 with SEED's architecture.
Apache License 2.0
798 stars 146 forks source link

Loading and running trained models #41

Closed sharsnik2 closed 4 years ago

sharsnik2 commented 4 years ago

I've successfully trained a network on my custom environment. Now I'd like to be able to observe the activities of the hidden states as the network navigates the environment. As such, I'm loading the latest checkpoint and trying to mimic the inference/environment step cycle in a single script. I will then gather the agent outputs and analyse them later. Here is the code I'm using is pasted below.

The issue is that while this code produces better than uninitialized returns on the environment, it is not nearly as good as the returns I'm getting in the eval agent during training (or even the non-eval agents). So, it seems that I must be missing something that is done during the training loop.

One possible thing is that when loading from the checkpoint the leaner wants agent AND target_agent. But, I'm only loading target agent: ckpt = tf.train.Checkpoint(target_agent=agent)

with strategy.scope():
    @tf.function
    def inference(*args):
        return agent(*decode(args))

observation = env.reset()
reward = 0.0
raw_reward = 0.0
done = False
zeroIndex = tf.constant([0], dtype=tf.int32)

agent_outputs.reset(zeroIndex)
agent_states.replace(zeroIndex, agent.initial_state(1))

while not done:
    env_output = utils.EnvOutput(reward, done, observation)

    env_outputs.replace(zeroIndex, env_output)

    input_ = encode((agent_outputs.read(zeroIndex), env_outputs.read(zeroIndex)))
    agent_output, agent_state = inference(input_, agent_states.read(zeroIndex))

    agent_outputs.replace(zeroIndex, agent_output.action)   
    agent_states.replace(zeroIndex, agent_state)    

    observation, reward, done, info = env.step(agent_output.action.numpy()[0])
sharsnik2 commented 4 years ago

I have a few updates. I've been running the code with the exact same environment layout each time. As such, my code returns the same total reward each run. The base seed_rl code (on an eval agent with epsilon = 0), however, returns the SAME value as my code on the first run-through, but then different values on all other runs.

The temporal progression of the hidden state is the same for EVERY run my code and the FIRST run of the seed_rl code, but after that, the seed_rl code has a different temporal progression of the hidden state (and it is the same for all future runs).

What is causing this difference in the first and second+ runs during training? One thing I found is that seed_rl doesn't seem reset the previous action when the environment resets, is this intended? However, even after forcing the reset, the behavior above persists.

Update: The reward is also not cleared on environment reset. This is what was causing the divergence in training and in my code.

Is there a reason that the previous action and previous reward are not reset when the environment resets? This seems pretty counter-intuitive to me, as one would expect that the response of the agent to a static environment should also be static.

lespeholt commented 4 years ago

Will look into it, but that's indeed counter intuitive. However, it shouldn't affect the score of the agent really as one can always give the agent an arbitrary reward/action if the 0 action and 0 reward for some reason is a really bad starting point.

lespeholt commented 4 years ago

Which agent do you try to load? R2D2 or V-trace?

sharsnik2 commented 4 years ago

I was loading an r2d2 agent. I also used a custom network based on the atari network.

You can indeed give a non zero action/reward pair as a starting point, but it's a bit tricky to know what range of starting points is likely to be "good". This is even more confusing, as the range of "good" starting points is likely to change as the training takes place (i.e. once the agent starts to win most games the first reward score will increase dramatically).

lespeholt commented 4 years ago

I definitely agree. Was just pointing out that it may not be the root cause of why you can't reproduce the training results.

sharsnik2 commented 4 years ago

Yeah, for sure. My code above does seem to work now that I've cleared the reward and action on environment reset, so I'll close this thread for now.

Thanks!

Edvard-D commented 3 years ago

Hi @sharsnik2, do you possibly still have the full file you used to run a trained model? I have a model of my own I'm trying to run, but am having a hard time understanding how to set everything up correctly. If you'd be willing to share that it would be a huge help!

sharsnik commented 3 years ago

Hey @Edvard-D. The code I've been using to run experiments (i.e. observe the activity of a trained agent) is here: https://github.com/sharsnik2/seed_rl/blob/unity/unity/runExperiment.py

I set this up for my custom unity environment and it will probably (definitely) not work out of the box for you, but hopefully it'll be a good starting point. If you have specific questions, I'll be happy to discuss.

Edvard-D commented 3 years ago

Thanks for the reply. I did manage to get it working after posting my question, but your link will be very useful for anyone looking to do the same in the future.