Closed Facebear-ljx closed 1 year ago
Dear Jianxiong,
Thanks for your attention. DOGE is a great work, and we've in fact cited a primary version (i.e., the version in the NeurIPS'22 workshop on DeepRL) of it. We will consider citing the ICLR'23 version.
Best regards, Yi-Chen
Thanks! It would be appreciated if you can adquetly discuss our paper in the related work and compare it as a baseline in your work considering the similarity with our paper.
Sorry for the late reply. From my perspective, DOGE and PRDC (our work) are essentially different. The centroid of DOGE is defined as $$ ao(s) = \mathbb{E}{a \sim Unif(\mathcal{A})}[C(s,a)\cdot a], $$ where $C(s,a) = \frac{\mu(s,a)}{\mathbb{E}_{a \sim Unif(\mathcal{A})}[\mu(s,a)]}$. That is, $a_o(s)$ is the expectation of actions from the offline dataset along with the given state $s$. It means that DOGE still regularizes the learned policy via the behavior policy's action distribution, which as discussed in our paper has limited generalization.
However, PRDC guides the policy via the point-to-set distance, defined as $$ d^\beta\mathcal{D}(s,a) = \min{(\hat{s},\hat{a})\in\mathcal{D}}|(\beta s)\oplus a-(\beta \hat{s})\oplus \hat{a}|, $$ Therefore, we may constrain the policy with an action which never be along with the given state $s$ in the offline dataset $\mathcal{D}$. So it will help relieve the learned policy from the behavior policy's action distribution.
Overall, DOGE is a great work and we are glad to discuss it in the related work and compare it as a baseline.
Close it if no further questions.
Dear Yuhang and Yi-Chen,
This is Jianxiong, a PHD student from AIR, Tsinghua University. I recently read the paper corresponding to this repository, and I really like the idea about regularizing offline RL with a point-to-dataset distance. However, I noticed that a similar idea was explored in one of my recent works [1], which seems to have been missed in the literature review section of your paper. Therefore, I would like to bring your attention to my paper [1].
In [1], we also consider incorporating the overall dataset geometry into policy learning. Specifically, we introduce a "state-conditioned distance function" as a regularization term in the policy training process. This distance function is trained using a simple regression loss and serves as an upper bound for the "point-to-centroid of dataset distance." Please refer to our paper for more details.
Overall, I believe that [1] is highly relevant to your work, and it would be greatly appreciated if you could include a citation to this paper in your manuscript.
[1] Li, J., Zhan, X., Xu, H., Zhu, X., Liu, J., & Zhang, Y. Q. When data geometry meets deep function: Generalizing offline reinforcement learning. (ICLR2023) https://arxiv.org/abs/2205.11027
Best regards, Jianxiong