toshikwa / gail-airl-ppo.pytorch

PyTorch implementation of GAIL and AIRL based on PPO.
MIT License
189 stars 30 forks source link

How to obtain rewards from the discriminator #6

Closed yaoxt3 closed 2 years ago

yaoxt3 commented 2 years ago

Hi, thanks for the wonderful job! I integrated this code onto Metaworld benchmark, and trained a agent to solve the drawer-close task. The trained agent can successfully solve the task, but rewards calculated by the recovered reward function is very small. The following code is how I get rewards from disc:

actor.to(torch.device('cuda:0'))
disc.to(torch.device('cuda:0'))
while not done:
    action = algo.exploit(state)
    torch_state = torch.from_numpy(np.float32(state)).to(torch.device('cuda:0'))
    torch_action = torch.from_numpy(np.float32(action)).to(torch.device('cuda:0'))
    log_pis = actor.evaluate_log_pi(torch_state, torch_action)

    state, reward, done, _ = env.step(action)
    next_state = torch.from_numpy(np.float32(state)).to(torch.device('cuda:0'))

    pre_reward = disc.calculate_reward(torch_state, done, log_pis, next_state)
    pre_reward = float(pre_reward)
    eposide_reward += reward
    pre_eposide_reward += pre_reward

    print(f'reward:{reward}, pre_rewa:{pre_reward}')

The pre_reward is very smaller than the ground truth reward, how can I get rewards close to the ground truth reward. Thanks. 0 < preward < 2 -1 < ground truth reward < 4000

yaoxt3 commented 2 years ago

I normalise the value of recovered reward function in the range of 2 to 4000, and the recovered reward function shows asymptotic performance with the ground-truth reward function.