Question about moving average in Basic_Run.py

a7744hsc commented 1 month ago

I am learning your code base and have a doubt about line 147 in Basic_Run.py

self.act = 0.4 * self.act + 0.6 * action

After smooth the action, the rewards and obs received are not 100% reflect the original action. Will this influence the training performance？ What's the intention we use this smooth method here?

Thanks very much.

m-abr commented 1 month ago

"What's the intention we use this smooth method here?"

The exponential moving average (EMA) applied here is meant to smooth the action. The idea is to initially guide the policy towards more stable and realistic behaviors, often leading to better performance. Over time, however, the policy can still learn non-smooth behaviors by increasing the magnitude of the actions, so this does not overly restrict the learning process in the long run.

"the rewards and obs received are not 100% reflect the original action"

You can think of the EMA filter as a hardware constraint, rather that a restriction imposed by software. Modifying the action prior to applying it to the robot is a common practice in RL. The key is that this modification must be deterministic for the policy: if we start in state S, and the policy outputs action A, we can modify A with an arbitrary function f(A)=A′, as long as A′ is uniquely determined by both S and A. But how can A′ be deterministic if it depends on the last filtered action? The last filtered action is included in S, represented by the current joint positions and speeds (lines 68-69 in _BasicRun.py, r.joints_position[2:22] and r.joints_speed[2:22]). This allows the policy to 'know' how f will affect action A for any given state S.

"Will this influence the training performance"

I think I already addressed this question, but I want to emphasize that experimentation is always the best way to draw conclusions in RL, as the results can sometimes be counterintuitive.

I hope this makes sense, but if you have any questions or need more details, feel free to ask.

a7744hsc commented 1 month ago

Thanks for your detailed explain. I also did a simple test without EMA and the result is not good.

Another question I have in mind is that the observation we got for each step is actually related to the last step action(due to the restriction of rcssserver3d). Do you think this is a issue for current training scripts?

m-abr commented 1 month ago

That's right. Unfortunately, this is an issue we can't fully circumvent. However, the low-level controller that applies the actions does try to predict the current position by adding the last action to the last position received from the server. This helps reduce the discrepancy, but the observations are still out of sync by one step. Based on the current observations (which include the position and velocity of joints as I mentioned earlier) the policy has to learn to predict a latent representation of the next state and output an action based on that.

That said, I believe the impact on training is negligible overall, since the prediction horizon is 0.02 seconds.

a7744hsc commented 1 month ago

That's right. Unfortunately, this is an issue we can't fully circumvent. However, the low-level controller that applies the actions does try to predict the current position by adding the last action to the last position received from the server. This helps reduce the discrepancy, but the observations are still out of sync by one step. Based on the current observations (which include the position and velocity of joints as I mentioned earlier) the policy has to learn to predict a latent representation of the next state and output an action based on that.

That said, I believe the impact on training is negligible overall, since the prediction horizon is 0.02 seconds.

Thanks for your sharing.

m-abr / FCPCodebase

Question about moving average in Basic_Run.py #24