vincentkslim / cs285_homework_fall2020

CS285 Homework
24 stars 14 forks source link

Question on HW1 #4

Open JiaheXu opened 3 years ago

JiaheXu commented 3 years ago

In HW1 MLP_policy.py if discrete is true, I think the out put should be a one-hot vector, or at least when you take actions, you need to take the argmax one in utils.py. I am a nooby in RL, I am not 100% sure, please take a look. In HW1 all problems are continuous, perhaps thats' why your code works.

mantle2048 commented 2 years ago

Hi, I quite agree with you.

Thanks for the author's great code for CS285 2020Fall homework!

There is a small problem in hw1.

In hw1 cs285/policies/MLP_policy.py, the author used the deterministic policy (directly through self.mean_tet to output actions).

This is incorrect in that we can see that self.logstd is set in the original code cs285/policies/MLP_policy.py, which is part of the stochastic policy.

In addition, I found that after modifying the author's code from deterministic policy to stochastic policy, the performance of BC in Ant -v2 is reduced from 4k to 1.4k.

I think 1.4k is what BC should perform.