jc-bao commented 1 year ago

The expert policy's final performance is poor under the disturbance.

curve	plot	visulization

The expected error is 0.02m, while the current error is 0.1m. The learned policy sometimes just hovers at a non-zero point.

jc-bao commented 1 year ago

Sanity check: can RL policy learn to control under fixed disturbance?
Experiment 1: remove other uncertain parameters, only keep the disturbance uncertainty. Compare fixed disturbance and random disturbance.
Conclusion:
- under the random disturbance, the final hovering position sometimes cannot converge to zero. (hovering at some point close to the origin. )
- the policy under random disturbance did not reach the expected upper bound.
Experiment 2: add other uncertain parameters (the mass and decay rate), Compare fixed disturbance and random disturbance.

Conclusion
- ⚠️The policy perform the same under different disturbance. And the final result is just hovering at some point close to the origin.

jc-bao commented 1 year ago

Assumption: reward scale or training parameter-related issue.
Solution: Try to make the velocity reward smaller or increase the gamma.
- make v reward smaller /2
- make gamma larger 0.95 -> 0.99
- make v reward smaller /4

Conclusion: not a parameter-related issue.

jc-bao commented 1 year ago

assumption: related to the first-order delay.

Conclusion: still cannot control precisely.

jc-bao commented 1 year ago

Possible explanation: entangle of the mass and disturbance.

When the mass is fixed: the controller only needs to learn a constant.

When the mass is variational: the controller needs to adapt to different force directions. Also, the decay force coefficient might also be different.

But still hard to explain why the policy tends to converge to a point close to the origin rather than the true origin point. Maybe the policy has learned a robust control policy rather than an adaptive control policy. This also explains why the expert policy and the vanilla policy have such a small margin. The result is not conflicted with the Drone RMA paper since they do not take time-varying constant force into consideration. Considering the drone's mass and force scale, our result still aligns with their conclusion.

If this is true, our next research question could be: How to embed the system's dynamic information effectively so that the policy will learn to use the superior information rather than just fit into a robust control policy.

jc-bao commented 1 year ago

Sanity check: only keep the mass and decay ratio uncertainty, and evaluate the expert and vanilla performance.

Curve	Policy Plot	Policy Visualization
		pi(x)pi(x,e)

Surprising finding: the expert policy performs worse than the vanilla policy.

After pending zeros to vanilla policy:

The performance is even worse. (Consider it as a variance?)

jc-bao commented 1 year ago

jc-bao / policy-adaptation-survey

Expert policy in constant wind is not as good as expected. #3

Possible explanation: entangle of the mass and disturbance.

5