Closed Eternity-Wang closed 1 month ago
Hi @Eternity-Wang , the intuition behind it is that: during the agent execution procedure, there might be infeasible target reward return and cost return, which may lead to the model confusion -- should it satisfy the target reward by increasing the cost or satisfying the target cost by reducing the reward? We want the latter one, so we synthesize such kind of data to make the model safer in such cases. Regarding the outlier, they are real trajectories with corresponding outlier reward and cost returns sampled by the model (but rarely happen). Note that in the data augmentation phase, we do relabeling, meaning that these trajectories themself do not have outlier reward and cost returns.
Hi @liuzuxin, thanks for your quickly and helpful reply, my understanding based on your description is as follows.
Hi @Eternity-Wang, it seems there may have been some misunderstandings. I just wanted to clarify that both outlier filtering and data augmentation are performed prior to training. During the execution phase, neither the target reward nor the cost return are modified. The user is able to set the values for the target reward and cost return, which are then passed directly to the trained agent to evaluate its performance.
Thank you for your helpful response, I seem to understand the stages and roles performed by outlier filtering and data augmentation. But I still have some questions about the data augmentation, hope you can help me to better understand the idea and insight you want to present:
Hi, can you please explain in detail the reason for relabelling infeasible target return pairs (Data augmentation by return relabeling in the paper)? I'm very confused about its relationship with outlier filtering, which is mentioned in the D.2 section of the paper.