toshikwa / gail-airl-ppo.pytorch

PyTorch implementation of GAIL and AIRL based on PPO.
MIT License
189 stars 30 forks source link

Potential bug during training? #11

Open liubaoryol opened 1 year ago

liubaoryol commented 1 year ago

Is there a reason you calculate the reward the way you do in line 69? https://github.com/toshikwa/gail-airl-ppo.pytorch/blob/4e13a23454600a16d5aeeeb4c09338308115455e/gail_airl_ppo/algo/airl.py#L69

My models were able to learn after I changed that line to

        with torch.no_grad():
            rewards = self.disc.g(states)

This gives the unshaped rewards

Charlesyyun commented 1 year ago

Did that work out for you? I found my actor loss unable to converge.

liubaoryol commented 1 year ago

Yes it did, although I was running it on discrete state and action environments. Which env are you using?

mikhail-vlasenko commented 1 year ago

@liubaoryol it is great to hear that you got it working with discrete action space! could you please share your code? i think it will be valuable, as multiple people here already asked about discrete action support. Thanks in advance

liubaoryol commented 1 year ago

Of course! Let me clean it up and I'll share it next week:)

jagandecapri commented 1 year ago

I'm interested to know about the implementation for discrete action support too. :)

ChenYunan commented 1 month ago

reward = -logsigmoid(-logits) = -log[1 - sigmoid(logits)] = -log(1 - D), which corresponds the objective of G is minimize log(1-D).