katerakelly / oyster

Implementation of Efficient Off-policy Meta-learning via Probabilistic Context Variables (PEARL)
MIT License
472 stars 125 forks source link

Temperature coefficient not found in SAC #9

Closed lujiayou123 closed 4 years ago

lujiayou123 commented 5 years ago

Thanks for your great work and code! I have a few confusions.

First,I notice that the temperature coefficient α not used during SAC training, it is not identical to the SAC algorithm, why?

Second, why policy_loss = policy_loss + policy_reg_loss? What do these terms mean ?
` mean_reg_loss = self.policy_mean_reg_weight * (policy_mean**2).mean()

    std_reg_loss = self.policy_std_reg_weight * (policy_log_std**2).mean()

    pre_tanh_value = policy_outputs[-1]

    pre_activation_reg_loss = self.policy_pre_activation_weight * (
        (pre_tanh_value**2).sum(dim=1).mean()
    )

    policy_reg_loss = mean_reg_loss + std_reg_loss + pre_activation_reg_loss

    policy_loss = policy_loss + policy_reg_loss`

Third,in rlkit.core.rl_algorithm.228&422, context= self.sample_context(self.task_idx) where is the function "sample_context" defined?

Finally, if we apply the temperature auto-adjustment trick of SAC to PEARL(arxiv1812.05905), would PEARL perform better?

katerakelly commented 4 years ago

Hi, sorry for such a late reply - this issue slipped by me when it first came up!

Questions 1, 2, and 4 relate to automatic entropy tuning. I tried auto-entropy on the benchmark continuous control tasks with PEARL and did not observe an improvement so I did not merge it to master. However, it might help in other tasks and would better align PEARL with the latest SAC, so I plan to clean this up and merge it soon.

Question 3 - that method is defined in sac.py. This is incorrect use of abstraction, but at this point I think it's just going to stay that way.