toshikwa / fqf-iqn-qrdqn.pytorch

PyTorch implementation of FQF, IQN and QR-DQN.
MIT License
161 stars 24 forks source link

one of the variables needed for gradient computation has been modified by an inplace operation #3

Closed RaviKumarAndroid closed 4 years ago

RaviKumarAndroid commented 4 years ago

I am using pytorch 1.5.0 and i am getting error "one of the variables needed for gradient computation has been modified by an inplace operation".

When i enabled torch anomaly detection to find which tensor was being modified inplace. I got error at line : https://github.com/ku2482/fqf-iqn-qrdqn.pytorch/blob/542a6e57cdbc8c467495215c5348800942037bfa/fqf_iqn_qrdqn/network.py#L71

Note: It works when i downgraded to pytorch 1.4.0 I am unable to find where is the issue to make it work on torch 1.5.0

RaviKumarAndroid commented 4 years ago

To reproduce this issue, just run FQF with torch 1.5.0

toshikwa commented 4 years ago

Hi, @RaviKumarAndroid

Thank you for sharing!! I quickly fixed the issue, so could you try it??

The cause of the problem is the entropy loss, which depends on the fraction network, and networks other than the fraction network should never be trained on entropy loss. (IQN considers fractions as out of its control.) Actually, because I set entropy_coef=0.0, this bug wouldn't affect the performance of my experiment.

Anyway, thank you so much for your advice :)

RaviKumarAndroid commented 4 years ago

It works now. What do you think would be the results if we added a sequential memory and LSTM cells to the DQNBase network? Similar to DRQN (Deep Recurrent Q Network) But Algorithm changed to FQF

I believe it has the potential to improve the result of FQF Further on Atari for Partially Observable MDP's. You can think of it as a combination of FQF, R2D2, Rainbow Its already been tried on IQN's (https://opherlieber.github.io/rl/2019/09/22/recurrent_iqn) I am working on trying it on FQF (Recurrent FQF)

Thanks for the fix :)

toshikwa commented 4 years ago

Cool ! I think it will work well with FQF, although I'm not sure why LSTM works so better. (many of Atari games with frame stack seem no longer POMDP.)

I hope it works well :)

RaviKumarAndroid commented 4 years ago

i think LSTM have something to do with being able to better learn and generalize transition probabilities rather than stacked frames. although i could be wrong. Because some additional linear features are fed to the LSTM (Though it’s not clear how much if at all these help): One-hot encoded last action Last reward (clipped) Timestep (Scaled to [0,1] range)

It kinda teaches it about how previous action affects current state (transition probability)

toshikwa commented 4 years ago

Thanks for sharing !

I see. I think especially rewards should be the good indicator of how good this episode is, which is informative for the agent. I want to see how reccurent-fqf works, so could you share your results with me when finished??

Please feel free to ask me if you have questions or need a help.