sfujim / BCQ

Author's PyTorch implementation of BCQ for continuous and discrete actions
MIT License
599 stars 139 forks source link

Some questions about the experiments for demonstrating extrapolation error. #15

Open awecefil opened 11 months ago

awecefil commented 11 months ago

Hello, I am currently studying offline reinforcement learning and came across BCQ. It's a great work worth delving into. However, I have some questions regarding the paper that I'd like to clarify and ensure that I haven't misunderstood. My questions might be numerous, but I genuinely want to understand the experimental details in the paper.

Here are my questions:

  1. In Figure 1, does "Off-policy DDPG" refer to DDPG trained using a fixed dataset without interaction with environment? Additionally, as a benchmark for comparison, does "Behavioral" refer to DDPG trained using the normal training process?

  2. In Figure 1, for the three experiments with different buffers, is "Final" understood as training a Behavioral DDPG, recording transitions during the training process, and then using the final buffer as a dataset for Off-Policy DDPG training (with no new transitions added to the buffer during training)? Can "Concurrent" be simply understood as Off-Policy DDPG gradually using transitions from early-stage to late-stage during the training process, rather than having the chance to sample late-stage transitions right from the beginning?

  3. In Figure 1, the orange horizontal lines in (a)(c) represent calculating the episode average return after collecting the complete buffer using Behavioral (after Behavioral training concludes) right? Is this also the reason why there is no such line in (b) (because the buffer is in the process of collecting transitions)?

  4. Based on the experiments in Figure 1, can it be understood that (1) even if offline RL uses a dataset with sufficient coverage, the extrapolation error (caused by the actor in DDPG taking an action out-of-distribution) leads to suboptimal performance? (2) even if offline RL uses the same buffer as Behavioral, because the transitions in the buffer are still not generated by offline RL itself, there is still a distribution shift issue. (3) even if offline RL is trained with expert or nearly expert data, without encountering "bad(early-state) data," it may fail to learn which actions should be avoided and cause the performance worse than the final and concurrent buffers.