liuzuxin / OSRL

🤖 Elegant implementations of offline safe RL algorithms in PyTorch
https://offline-saferl.org
Apache License 2.0
178 stars 12 forks source link

Some questions about the technology used in the CDT paper #24

Closed Eternity-Wang closed 1 month ago

Eternity-Wang commented 2 months ago

Hi, can you please explain in detail the reason for relabelling infeasible target return pairs (Data augmentation by return relabeling in the paper)? I'm very confused about its relationship with outlier filtering, which is mentioned in the D.2 section of the paper.

liuzuxin commented 2 months ago

Hi @Eternity-Wang , the intuition behind it is that: during the agent execution procedure, there might be infeasible target reward return and cost return, which may lead to the model confusion -- should it satisfy the target reward by increasing the cost or satisfying the target cost by reducing the reward? We want the latter one, so we synthesize such kind of data to make the model safer in such cases. Regarding the outlier, they are real trajectories with corresponding outlier reward and cost returns sampled by the model (but rarely happen). Note that in the data augmentation phase, we do relabeling, meaning that these trajectories themself do not have outlier reward and cost returns.

Eternity-Wang commented 2 months ago

Hi @liuzuxin, thanks for your quickly and helpful reply, my understanding based on your description is as follows.

  1. If the target reward return and target cost return set during the execution phase are infeasible, e.g., greater than the reward of RF points under the certain cost threshold, only then do we need to modify them with the help of data augmentation to ensure that they are safe to prioritize.
  2. To be able to use data augmentation, we need to have access to the training dataset at the time of execution, otherwise it is also not possible to look for the nearest safe trajectory.
  3. Outlier filtering occurs after the dataset has been constructed but before training, whereas data augmentation occurs after training. I hope my understanding matches what you're expected to express.
Ja4822 commented 2 months ago

Hi @Eternity-Wang, it seems there may have been some misunderstandings. I just wanted to clarify that both outlier filtering and data augmentation are performed prior to training. During the execution phase, neither the target reward nor the cost return are modified. The user is able to set the values for the target reward and cost return, which are then passed directly to the trained agent to evaluate its performance.

Eternity-Wang commented 2 months ago

Thank you for your helpful response, I seem to understand the stages and roles performed by outlier filtering and data augmentation. But I still have some questions about the data augmentation, hope you can help me to better understand the idea and insight you want to present:

  1. Why it can enable the policy to be the safety-first?
  2. Whether data augmentation can be interpreted as a modification of the original RF value at certain cost threshold k, since the algorithm seems to add an augmented trajectory in which the target cost reward is equal to the cost threshold k, but the target reward reward is greater than the original RF value.