openai / coinrun

Code for the paper "Quantifying Transfer in Reinforcement Learning"
https://blog.openai.com/quantifying-generalization-in-reinforcement-learning/
MIT License
388 stars 87 forks source link

why the next_state never changes? #31

Closed Unimax closed 5 years ago

Unimax commented 5 years ago

Please help me understand why the previous state is always equal to the next state ? if thats the case how will any NN will work on state.

import numpy as np
from q_learning.utils import Scalarize
from coinrun import make,setup_utils

def testing():
    setup_utils.setup_and_load()
    episodes = 10
    env = Scalarize(make('standard', num_envs=1))
    for i in range(episodes):
        previous_state = env.reset()
        while True:
            env.render()
            action = np.random.randint(0, env.action_space.n)
            next_state, reward, done, info = env.step(action)
            print("current state is equal to previous state : ", np.array_equal(next_state, previous_state))

            previous_state = next_state
            if done or reward > 0:
                break
def main():
    testing()

if __name__ == '__main__':
    main()

Output:

....
current state is equal to previous state :  True
current state is equal to previous state :  True
current state is equal to previous state :  True
current state is equal to previous state :  True
current state is equal to previous state :  True
current state is equal to previous state :  True
current state is equal to previous state :  True
...
sanjayyyyyy commented 5 years ago

@Unimax How do you get the percentage of levels solved per time stamp as shown in the paper for training as well as testing. and also how do i get the average reward values per time stamp

Unimax commented 5 years ago

@sanjayyyyyy the avg reward from --train-eval or --test comands can be multiplied with 10 to get % of level solved. for example use command for train:- python -m coinrun.train_agent --run-id myrun --num-levels 500 --set-seed 13 and use this for test:- python -m coinrun.enjoy --train-eval --restore-id myrun -num-eval 8 -rep 500 and multiply the final output of avg. reward which is out of 10 by 10.

i wrote my own testing function for pytorch based model which looks something like this:

def test(config):
    """Test routine"""

    env = utils.Scalarize(make('standard', num_envs=1))
    agent = DQN(env.observation_space.shape, env.action_space.n)
    if config.enable_gpu and torch.cuda.is_available():
        agent = agent.cuda()
    bestmodel_file = os.path.join(config.save_dir, config.model_filename)
    load_res = torch.load(bestmodel_file, map_location="cpu")
    agent.load_state_dict(load_res["model"])
    agent.eval()
    success = 0
    total_steps = 0
    for i in range(config.testing_level_count):
        state = env.reset()
        ep_reward = 0
        ep_length = 0
        while True:
            if config.render_play:
                env.render()
            state = torch.unsqueeze(torch.FloatTensor(state), 0)
            action = torch.max(agent.forward(state), 1)[1].data.numpy()[0]

            next_state, reward, done, info = env.step(action)

            ep_length += 1
            ep_reward += reward

            state = copy.copy(next_state)

            if done:
                print("episode: {} , the episode reward : {} with length : {}".format(i, ep_reward, ep_length))
                break
        # state = next_state

        if ep_reward > 0:
            success = success + 1
        total_steps += ep_length

    print("Testing result : {} % completed. Avg. ep length : {}".format(success*100/config.testing_level_count , total_steps / config.testing_level_count))
    env.close()

=========================== and avg reward per time stamp is (total avg. reward / avg length of episodes ) total avg. reward will be between 0-10 and avg length will be less then 250.

i personally using custom reward system like -1 for each time stamp -500 for deaths and +500 for coin etc.

Note: i am not a contributer on this repo. i am using coinrun for my project to apply DQN on it. here is my non perfect code:https://github.com/Unimax/coinrun-dqn-pytorch

sanjayyyyyy commented 5 years ago

@sanjayyyyyy the avg reward from --train-eval or --test comands can be multiplied with 10 to get % of level solved. for example use command for train:- python -m coinrun.train_agent --run-id myrun --num-levels 500 --set-seed 13and use this for test:- python -m coinrun.enjoy --train-eval --restore-id myrun -num-eval 8 -rep 500 and multiply the final output of avg. reward which is out of 10 by 10.

i wrote my own testing function for pytorch based model which looks something like this:

def test(config):
    """Test routine"""

    env = utils.Scalarize(make('standard', num_envs=1))
    agent = DQN(env.observation_space.shape, env.action_space.n)
    if config.enable_gpu and torch.cuda.is_available():
        agent = agent.cuda()
    bestmodel_file = os.path.join(config.save_dir, config.model_filename)
    load_res = torch.load(bestmodel_file, map_location="cpu")
    agent.load_state_dict(load_res["model"])
    agent.eval()
    success = 0
    total_steps = 0
    for i in range(config.testing_level_count):
        state = env.reset()
        ep_reward = 0
        ep_length = 0
        while True:
            if config.render_play:
                env.render()
            state = torch.unsqueeze(torch.FloatTensor(state), 0)
            action = torch.max(agent.forward(state), 1)[1].data.numpy()[0]

            next_state, reward, done, info = env.step(action)

            ep_length += 1
            ep_reward += reward

            state = copy.copy(next_state)

            if done:
                print("episode: {} , the episode reward : {} with length : {}".format(i, ep_reward, ep_length))
                break
        # state = next_state

        if ep_reward > 0:
            success = success + 1
        total_steps += ep_length

    print("Testing result : {} % completed. Avg. ep length : {}".format(success*100/config.testing_level_count , total_steps / config.testing_level_count))
    env.close()

=========================== and avg reward per time stamp is (total avg. reward / avg length of episodes ) total avg. reward will be between 0-10 and avg length will be less then 250.

i personally using custom reward system like -1 for each time stamp -500 for deaths and +500 for coin etc.

Note: i am not a contributer on this repo. i am using coinrun for my project to apply DQN on it. here is my non perfect code:https://github.com/Unimax/coinrun-dqn-pytorch

I saw your work in your repository, I'm trying to implement attention DRQN. Do you think it is a good idea?

Unimax commented 5 years ago

well till now i did DQN and DDQN up to 1000 episodes and not able to get a decent agent for 5 random levels except for once (rare case). Most of my graphs are for one fixed level until now. even I am planning to connect an LSTM in my Network next week and see the changes. and see long training impact on multi-level training using DQN variants.

I would like to see the results of DRQN. I surely think it should help and we can even disable the velocity panted on the images, in that case. But i think it would require long training to get good results for multiple level case (for example paper did training for 500 levels). I had limited time to submit the project and hence did not tried too long training and too many other cases till now.

Also, I am No expert :) just finished my first term and learned a bit of Deep Learning.