Closed ngthanhtin closed 2 years ago
Hi,
the reward is computed here:
bool VisionEnv::computeReward(Ref<Vector<>> reward) {
// ---------------------- reward function design
// - compute collision penalty
Scalar collision_penalty = 0.0;
size_t idx = 0;
for (size_t sort_idx : sort_indexes(relative_pos_norm_)) {
if (idx >= visionenv::kNObstacles) break;
Scalar relative_dist =
relative_pos_norm_[sort_idx]
? (relative_pos_norm_[sort_idx] > 0) &&
(relative_pos_norm_[sort_idx] < max_detection_range_)
: max_detection_range_;
const Scalar dist_margin = 0.5;
if (relative_pos_norm_[sort_idx] <=
obstacle_radius_[sort_idx] + dist_margin) {
// compute distance penalty
collision_penalty += collision_coeff_ * std::exp(-1.0 * relative_dist);
}
idx += 1;
}
// - tracking a constant linear velocity
Scalar lin_vel_reward =
vel_coeff_ * (quad_state_.v - goal_linear_vel_).norm();
// - angular velocity penalty, to avoid oscillations
const Scalar ang_vel_penalty = angular_vel_coeff_ * quad_state_.w.norm();
// change progress reward as survive reward
const Scalar total_reward =
lin_vel_reward + collision_penalty + ang_vel_penalty + survive_rew_;
// return all reward components for debug purposes
// only the total reward is used by the RL algorithm
reward << lin_vel_reward, collision_penalty, ang_vel_penalty, survive_rew_,
total_reward;
return true;
}
This is called step/stage reward.
The terminal reward is computed differently here depending on the terminal condition:
bool VisionEnv::isTerminalState(Scalar &reward) {
// simulation time out
if (cmd_.t >= max_t_ - sim_dt_) {
reward = 0.0;
return true;
}
// world boundling box check
// - x, y, and z
const Scalar safty_threshold = 0.1;
bool x_valid = quad_state_.p(QS::POSX) >= world_box_[0] + safty_threshold &&
quad_state_.p(QS::POSX) <= world_box_[1] - safty_threshold;
bool y_valid = quad_state_.p(QS::POSY) >= world_box_[2] + safty_threshold &&
quad_state_.p(QS::POSY) <= world_box_[3] - safty_threshold;
bool z_valid = quad_state_.x(QS::POSZ) >= world_box_[4] + safty_threshold &&
quad_state_.x(QS::POSZ) <= world_box_[5] - safty_threshold;
if (!x_valid || !y_valid || !z_valid) {
reward = -1.0;
return true;
}
return false;
}
Thanks, I got it, the 'r' has to be added with the terminal reward as well. But taking the sum over many steps makes the error around 0.01.
Hi, when I tried to print the info given when an episode has done, I found this problem: As you can see, the total reward 'r' is not equal to the sum of all 4 reward components? Why would this happen? Can you explain more, please?