Closed trevormcinroe closed 2 months ago
Hi, we rescaled the reward from -1/0 to -5/+5, but it is still sparse in the sense that the trajectory contains only two values -- positive and negative rewards. We found that rescaling the reward can make CQL/Cal-QL train more stably.
In the paper, the adroit environments are described as being sparse: "The agent obtains a sparse binary +1/0 reward if it succeeds in solving the task".
But, the code actually makes them not sparse due to
reward_scale
andreward_bias
in: https://github.com/nakamotoo/Cal-QL/blob/main/scripts/run_adroit.sh#L30-L31Is the code incorrect?