openai / spinningup

An educational resource to help anyone learn deep reinforcement learning.
https://spinningup.openai.com/
MIT License
9.91k stars 2.19k forks source link

Pytorch PPO Implementation, dimension difference #384

Open kevin-mahon opened 1 year ago

kevin-mahon commented 1 year ago

Hello, apologies if I do this wrong I don't contribute to open source often. I was attempting to run the Pytorch PPO implementation and kept getting several errors regarding the dimension of the observation 'o'. I believe this is because the return values from the environment are being mishandled. See my proposed changes below: Original:

o, ep_ret, ep_len = env.reset(), 0, 0

# Main loop: collect experience in env and update/log each epoch
for epoch in range(epochs):
    for t in range(local_steps_per_epoch):
        a, v, logp = ac.step(torch.as_tensor(o, dtype=torch.float32))

        next_o, r, d, _ = env.step(a)
        ep_ret += r
        ep_len += 1

        # save and log
        buf.store(o, a, r, v, logp)
        logger.store(VVals=v)

        # Update obs (critical!)
        o = next_o

        timeout = ep_len == max_ep_len
        terminal = d or timeout
        epoch_ended = t==local_steps_per_epoch-1

        if terminal or epoch_ended:
            if epoch_ended and not(terminal):
                print('Warning: trajectory cut off by epoch at %d steps.'%ep_len, flush=True)
            # if trajectory didn't reach terminal state, bootstrap value target
            if timeout or epoch_ended:
                _, v, _ = ac.step(torch.as_tensor(o, dtype=torch.float32))
            else:
                v = 0
            buf.finish_path(v)
            if terminal:
                # only save EpRet / EpLen if trajectory finished
                logger.store(EpRet=ep_ret, EpLen=ep_len)
            o, ep_ret, ep_len = env.reset(), 0, 0

To fix this I just get rid of the 'info' return value from env.reset, but I imagine this might not work for all environments (if they have extra returns for the env.step() for example).

Adjusted:

(o,_), ep_ret, ep_len = env.reset(), 0, 0

# Main loop: collect experience in env and update/log each epoch
for epoch in range(epochs):
    for t in range(local_steps_per_epoch):
        a, v, logp = ac.step(torch.as_tensor(o, dtype=torch.float32))

        next_o, r, d, _ = env.step(a)
        ep_ret += r
        ep_len += 1

        # save and log
        buf.store(o, a, r, v, logp)
        logger.store(VVals=v)

        # Update obs (critical!)
        o = next_o

        timeout = ep_len == max_ep_len
        terminal = d or timeout
        epoch_ended = t==local_steps_per_epoch-1

        if terminal or epoch_ended:
            if epoch_ended and not(terminal):
                print('Warning: trajectory cut off by epoch at %d steps.'%ep_len, flush=True)
            # if trajectory didn't reach terminal state, bootstrap value target
            if timeout or epoch_ended:
                _, v, _ = ac.step(torch.as_tensor(o, dtype=torch.float32))
            else:
                v = 0
            buf.finish_path(v)
            if terminal:
                # only save EpRet / EpLen if trajectory finished
                logger.store(EpRet=ep_ret, EpLen=ep_len)
            (o,_), ep_ret, ep_len = env.reset(), 0, 0

Let me know if anyone else has run into this problem or if this fix works.

Alberto-Hache commented 1 year ago

Hi, Kevin. I'm regularly using that same algorithm, pytorch PPO, and never had that problem. What error are you getting exactly? And on which environment(s)?

Best.

Alberto

El mié, 22 feb 2023, 21:25, kevin @.***> escribió:

Hello, apologies if I do this wrong I don't contribute to open source often. I was attempting to run the Pytorch PPO implementation and kept getting several errors regarding the dimension of the observation 'o'. I believe this is because the return values from the environment are being mishandled. See my proposed changes below: Original:

o, ep_ret, ep_len = env.reset(), 0, 0

Main loop: collect experience in env and update/log each epoch

for epoch in range(epochs): for t in range(local_steps_per_epoch): a, v, logp = ac.step(torch.as_tensor(o, dtype=torch.float32))

    next_o, r, d, _ = env.step(a)
    ep_ret += r
    ep_len += 1

    # save and log
    buf.store(o, a, r, v, logp)
    logger.store(VVals=v)

    # Update obs (critical!)
    o = next_o

    timeout = ep_len == max_ep_len
    terminal = d or timeout
    epoch_ended = t==local_steps_per_epoch-1

    if terminal or epoch_ended:
        if epoch_ended and not(terminal):
            print('Warning: trajectory cut off by epoch at %d steps.'%ep_len, flush=True)
        # if trajectory didn't reach terminal state, bootstrap value target
        if timeout or epoch_ended:
            _, v, _ = ac.step(torch.as_tensor(o, dtype=torch.float32))
        else:
            v = 0
        buf.finish_path(v)
        if terminal:
            # only save EpRet / EpLen if trajectory finished
            logger.store(EpRet=ep_ret, EpLen=ep_len)
        o, ep_ret, ep_len = env.reset(), 0, 0

To fix this I just get rid of the 'info' return value from env.reset, but I imagine this might not work for all environments (if they have extra returns for the env.step() for example).

Adjusted:

(o,_), ep_ret, ep_len = env.reset(), 0, 0

Main loop: collect experience in env and update/log each epoch

for epoch in range(epochs): for t in range(local_steps_per_epoch): a, v, logp = ac.step(torch.as_tensor(o, dtype=torch.float32))

    next_o, r, d, _ = env.step(a)
    print(f"next_o: {next_o}")
    print(f"r: {r}")
    print(f"d: {d}")
    print(f"_: {_}")
    ep_ret += r
    ep_len += 1

    # save and log
    buf.store(o, a, r, v, logp)
    logger.store(VVals=v)

    # Update obs (critical!)
    o = next_o

    timeout = ep_len == max_ep_len
    terminal = d or timeout
    epoch_ended = t==local_steps_per_epoch-1

    if terminal or epoch_ended:
        if epoch_ended and not(terminal):
            print('Warning: trajectory cut off by epoch at %d steps.'%ep_len, flush=True)
        # if trajectory didn't reach terminal state, bootstrap value target
        if timeout or epoch_ended:
            _, v, _ = ac.step(torch.as_tensor(o, dtype=torch.float32))
        else:
            v = 0
        buf.finish_path(v)
        if terminal:
            # only save EpRet / EpLen if trajectory finished
            logger.store(EpRet=ep_ret, EpLen=ep_len)
        (o,_), ep_ret, ep_len = env.reset(), 0, 0

Let me know if anyone else has run into this problem or if this fix works.

— Reply to this email directly, view it on GitHub https://github.com/openai/spinningup/issues/384, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFRLIPR3LFAGJCMUDAKQDJDWYZY4VANCNFSM6AAAAAAVEZHQVM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

kevin-mahon commented 1 year ago

Hello, I'm using a custom environment following the OpenAI Gym api. This would be the error (although this time for td3)

Traceback (most recent call last): File "src/laser_gym.py", line 43, in train_test() File "src/laser_gym.py", line 25, in train_test td3(env_fn=env_fn, ac_kwargs=ac_kwargs, steps_per_epoch=1000, epochs=260, logger_kwargs=logger_kwargs) File "/home/kevin/work/spinningup/spinup/algos/pytorch/td3/td3.py", line 334, in td3 test_agent() File "/home/kevin/work/spinningup/spinup/algos/pytorch/td3/td3.py", line 276, in testagent o, r, d, = test_env.step(get_action(o, 0)) File "/home/kevin/work/spinningup/spinup/algos/pytorch/td3/td3.py", line 267, in get_action a = ac.act(torch.as_tensor(o, dtype=torch.float32)) ValueError: expected sequence of length 2 at dim 1 (got 1)

My reset and step functions are defined as follows:

def reset(self, options=None):
    self._agent_location = np.random.random_integers(int(self.LOW/2), int(self.HIGH/2), size=2).astype(self.DTYPE)
    obs = self._get_obs()
    info = self._get_info()
    return obs, info

def step(self, action):
    _new_agent_location = self._agent_location + np.matmul(self.response_matrix,action)
    rew = self.reward.get_reward(self._agent_location, action, _new_agent_location)
    self._agent_location = _new_agent_location

    if (np.abs(self._agent_location) > self.HIGH).any() or (np.abs(self._agent_location) < 0.05).all():
        done = True #hit the edge
    else:
        done = False 

    obs = self._get_obs()
    info = self._get_info()

    if self.render_mode == 'human':
        self.render()

    return obs, rew, done, info

I am a little confused how this could work for other people when it seems like all the environment's should return the same tuple (https://www.gymlibrary.dev/api/core/#gym.Env.reset)