rll-research / cic

CIC: Contrastive Intrinsic Control for Unsupervised Skill Discovery
78 stars 18 forks source link

The state and skill encoder learned with contrastive learning is never used? #6

Open xf-zhao opened 2 years ago

xf-zhao commented 2 years ago

Hi, thank you very much for sharing the codes of the paper. Integrating contrastive learning into skill discovery is very attractive.

However, I found that in this implementation, the state encoder and skill encoder in cic module ($g_{\psi1}$ and $g{\psi_2}$ in the paper) are never used before being fed into policy neural networks. In cic/agent/cic.py line 222, parameters in cic is updated once but not called for encoding obs and skill thereafter.

Another question is how can the agent guarantee that the policy is "indeed conditioned on z" since the intrinsic reward has noting to do with z? In another word, $\tau$ can be arbitarily diverse, which is good for exploration, but there lacks a mechnism to ensure the agent know "what's the influnce of z".

I really like your work. But these issues confuse me a lot. Please correct me if I am wrong or miss something. Thank you again for your kindness of sharing.

pickxiguapi commented 2 years ago

Hi, I have the same confusion too, may I ask whether your question has been solved now? I think the contrastive learning updated parameters are not being used.

xf-zhao commented 2 years ago

Hi, I have the same confusion too, may I ask whether your question has been solved now? I think the contrastive learning updated parameters are not being used.

@pickxiguapi Hi, sorry, not solved. I think this is a mistake the author has not noticed since the work is still somehow in the progress / unfinished totally.

seolhokim commented 2 years ago

Why should g1 and g2 be used after updating once? I think there is no reason to call it from anywhere else before finetuning.

kc-ustc commented 5 months ago

I would like to ask a simple question. During pre-training, it was found that the neg in compute.cpc_loss is approximately 1200, while the pos is around 6. Is this a normal phenomenon?