jc-bao commented 1 year ago

This issue is used to keep a record of the research emphasis and engineering effort of each research period.

jc-bao commented 1 year ago

:one: Identify research questions

possible issues:

multi-dynamic model-free RL: (not very important)
- how to better integrate information
- performance given residue dynamics.
- guarantee given new environment -> model-based approach, understand the optimization process.
:star: Inference-based adaptation module:
- Generalizability given model mismatch (OOD) -> introduce another optimization method
- Feedback given imprecise estimation -> introduce a method to identify the problem

jc-bao commented 1 year ago

Meeting@2023.1.17

Meeting topic: possible research questions and next step plan.
Meeting notes
- Task: sim2lab2real
- Description:
  - In the sim, learn the basic policies.
  - In the lab, learn to generalize.
- Possible issues:
  - How to discriminate lab and sim (why not directly access to sim. )
  - How to adapt in real.
  - sim/lab's difference with z:
  - parametric OOD: mainly expert issue. (adaptor is successful. )
  - non-parametric OOD: how to define the residue model. (Tried method: dropout(x, performance the same. ))
- Proposed method:
- Inference-based method (like the RNN policy, RMA or Deep linear representation) for fast adaptation
  - Approach:
  - RMA for environment parameters inference
  - Possible issue: model mismatch leads to imprecise parameter prediction.
- Optimization-based method to compensate for model mismatch.
  - Approach1: variable w
  - Learning input parameter $w$ to control policy behavior. Optimize w with BO or a random search algorithm for better expert performance.
  - Possible issue: entangle of optimization parameter w and inference variable z.
  - Approach2: gradient-based methods (like MAML)
  - Classical meta-learning approach
  - Possible issue: sample efficiency.

jc-bao commented 1 year ago

Meeting@2023.1.28

About the environment:
- Is our residue dynamic too hard? (very sensitive to u)
- How to define real and sim environments. (Of course, we can use parametric OOD, but the main issue is expert performance, and the policy is hard to control with a single policy. )
- Can the sim access to residue model?
- How to define the residue model?
- How to discriminate between two environments?
About methods:
- Input variable w method: Hard to use embeddings to change policy under the current setting.
- Gradient-based method: how to identify our contribution?

TODO

[ ] Next step: w interpolation OOD
[ ] Study e -> z mapping (none OOD case) (With f(x,u,w) and f(x,w))

jc-bao commented 1 year ago

:no_entry: Vector: expert performance under residue dynamics

Research question: how to enable a model-free method to achieve better performance given complex dynamics?

Expected result: better tracking performance in residue dynamics.

❌ Velocity: Curriculum

assumption: the robot learn a robust policy which is agnoistic to different residue dynamics.
description: first learn without period disturbance, then add it back
results: the same (11cm error)
conclusion: curriculum cannot solve the problem. The main issue still comes from the suboptimal strategy, given partially random dynamics.

:warning: Deprecated due to the case might not match the real-world case. (too much human-engineering dynamics. )

:eye_speech_bubble: Hindsight: wrong trail. Our research question is to achieve high-performance adaptive control, not better model-free performance.

jc-bao commented 1 year ago

:stop_button: Vector: adaptor performance gave imperfect model

Question: how to enable adaptor to generalize to unseen scenarios?

❌ Velocity: Soft update adaptor

assumption:
- the expert embedding might be unobservable -> need also update policy during the adaptation stage.
- the adaptor module is not involved during training (also not consider its observability ) -> regularize the adaptor during the training
- compressor is not uncertainty aware -> imitate PPO to add uncertainty (which is log_std)
description
- Training:
- [TRA]if regularize adaptor -> consider adaptation module during the training
  - :eye_speech_bubble: Actually cannot affect the compressor. The implementation needs to be checked.
- [TUC] Uncertainty in compressor
  - :eye_speech_bubble: Handcoded uncertainty. The implementation needs to be checked.
- Adapting:
- [AOAC] if optimize actor/critic, -> actor might not be optimal
- [ARC] regularize compressor(0.0, 1.0, 10.0) -> instead of mimicking the expert module, we can also update the policy
  - :eye_speech_bubble: check implementation.
Results

Origin (L2 loss to z)

Train(policy=expert, ) Adapt(policy=adaptor, ) Perfect model

Method	Expert	Adapt begin	Adapt end
Baseline	0.0444	0.0518	0.0395
ARC0	0.0444	1.3167	0.3180
ARC1	0.0444	0.7872	0.1423
ARC10	0.0444	0.0770	0.0363
ARC50	0.0444	0.0572	0.0384
TUC-1	0.1663	0.1810	0.1635
TUC-3	0.1229	0.0784	0.0961

Imperfect model

Method	Expert	Adapt begin	Adapt end
Baseline	0.5054	0.1674	0.2007
Baseline-Adapt	0.0379	0.0435	0.0416
TUC-3	0.5338	0.2005	0.2476
TUC-3-Adapt	0.0453	0.0514	0.0470

Conclusion: none of these methods work. Need to identify the true limitation before diving into details. The adaptor performance is good enough given different dynamics.

:eye_speech_bubble: Hindsight: wrong trail. Dive into detail before identifying the true research question.

▶️ Vector: identify the adaptor module problem

Question: under what circumstances can we observe a significant performance drop for the RMA algorithm?

🔧 Velocity1: Unobservable parameters.

Polynomial residue dynamics

Description

$f(v,w) = x^T M x + C$, where $x=[v, w], v \in R^3, w \sim \mathcal{U}(-1,1) \in R^{d_w} $ $C \sim \mathcal{U}(-1,1) \in R^3, M \sim \mathcal{U}(-1,1) \in R^{3 \times 4 \times 4}$

Results

Last 10 steps average tracking error.	$d_w=2$ Expert	$d_w=2$ RMA before adaptation	$d_w=2$ RMA after adaptation
0.027	0.065	0.029	0.147
$d_w=1$ Expert	$d_w=1$ RMA before adaptation	$d_w=1$ RMA after adaptation
0.063	0.101	0.077
$d_w=1$ C-4 Expert	$d_w=1$ C-4 RMA before adaptation	$d_w=1$ C-4 RMA after adaptation
0.067	0.097	0.074
$d_w=0$ Expert	$d_w=0$ RMA before adaptation	$d_w=0$ RMA after adaptation
0.008	0.028	0.007

*C-4: use MLP to compress all parameters to a 4-dimensional embedding.

Conclusion:
- The introduction of random parameter $w$ makes the task much harder.
- This residue dynamic is unsuitable since the RMA performs very well compared to the expert policy.
- Expert policy given parameter $w$ could be further improved.

MLP $f(v, w)$ residue dynamics

Description

[128, 128] Mlp initialized with nn.init.orthogonal_(m.weight, gain=1), nn.init.uniform_(m.bias, -0.2, 0.2).

Results

$d_w=1$ Expert	$d_w=1$ RMA before adaptation	$d_w=1$ RMA after adaptation	Vanilla(Robust)
0.0141	0.0294	0.0161	0.0354
$d_w=1$ C-4 Expert	$d_w=1$ C-4 RMA before adaptation	$d_w=1$ C-4 RMA after adaptation
0.0169	0.0183	0.0164

Conclusion:
- basically the same as the quadratic version.
- Why is the performance still relatively good? A: When the drone is near the stable state, the velocity is also relatively small. As a result, the residue force also converges to a fixed value. Consequently, the agent only needs to learn the converged fixed force from the observation. We might observe a significant performance drop when adding action to the residue policy $f(v,u,w)$ ? More results in the next session.

MLP $f(v, u,w)$ residue dynamics

Results:

low sensitivity to u

$d_w=0$ Expert	$d_w=0$ RMA before adaptation	$d_w=0$ RMA after adaptation	Vanilla(Robust)
0.0189	0.0215	0.0183	0.0640
$d_w=1$ Expert	$d_w=1$ RMA before adaptation	$d_w=1$ RMA after adaptation
0.0386	0.0422	0.0274
$d_w=1$ C-4 Expert	$d_w=1$ C-4 RMA before adaptation	$d_w=1$ C-4 RMA after adaptation
0.0103	0.0120	0.0092

high sensitivity to u (by using non-zero bias mlp. )

$d_w=0$ Expert	$d_w=0$ RMA before adaptation	$d_w=0$ RMA after adaptation	Vanilla(Robust)
0.0287	0.0340	0.0296	0.0586
$d_w=1$ Expert	$d_w=1$ RMA before adaptation	$d_w=1$ RMA after adaptation
0.0687	0.0449	0.0331
$d_w=1$ C-4 Expert	$d_w=1$ C-4 RMA before adaptation	$d_w=1$ C-4 RMA after adaptation
0.0367	0.0331	0.0269

high sensitivity + force scale *2

$d_w=1$ Expert	$d_w=1$ RMA before adaptation	$d_w=1$ RMA after adaptation
0.0687	0.0449	0.0331
$d_w=1$ C-4 Expert	$d_w=1$ C-4 RMA before adaptation	$d_w=1$ C-4 RMA after adaptation
0.0367	0.0331	0.0269

Conclusion:
- The policy becomes unstable in this case.
- But the tracking performance is not bad.
- The compressor becomes important in unobservable cases.
Interesting observation: the network can handle the unobservable parameters pretty well.

🔧 Velocity2: Parameter OOD Cases.

Extrapolation OOD case

Description: Training with 70% left parameters, test with all parameters.
Result:

OOD Expert	OOD RMA before adaptation	OOD RMA after adaptation	Plot
0.2074	0.4166	0.2110
w/o OOD Expert	w/o OOD RMA before adaptation	w/o OOD RMA after adaptation	Plot
0.0687	0.0449	0.0331
C-4 OOD Expert	C-4 OOD RMA before adaptation	C-4 OOD RMA after adaptation	Plot
0.3596	0.3225	0.3516
C-4 w/o OOD Expert	C-4 w/o OOD RMA before adaptation	C-4 w/o OOD RMA after adaptation	Plot
0.0629	0.0482	0.0403

	w/o OOD	OOD
mass-decay
decay-param
param-mass

Updated results

Policy	Expert	Before adaptation	After adaptation
Baseline	0.2287	nan	nan
Baseline-OOD(full dynamic param)	0.2567	nan	nan
Baseline-OOD	0.3688	nan	nan
RMA	0.0945	0.1784	0.1455
RMA-OOD(full dynamic param)	0.1448	0.2841	0.2186
RMA-OOD	0.2119	0.3287	0.2714

Conclusion
- The RMA cannot handle parameter OOD cases very well (Mainly expert issues. )
TODO
- [ ] Compared with other meta-learning appraoches.

Intropolation OOD case

Policy	Expert	Before adaptation	After adaptation
No OOD	0.1183	0.2247	0.1317
intra OOD full	0.1883	0.3352	0.2461
extra OOD full	0.2446	0.4018	0.3160
only intra res dyn	0.1008	0.2032	0.1161
only extra res dyn	0.1134	0.2205	0.1313

Conclusion: the intra-polation is much better.

Visualize environment encoder mapping.

50%	70%	100%

Center	Left	Out	Full

Higher dimensional case

Test case: compress 2d/3d/4d residue dynamic parameters with environment encoder to a 2d embedding.

2d	3d	4d
	dim3=0	dim3,4=[0,0]
	dim3=0.5	dim3,4=[1,0]
	dim3=1.0	dim3,4=[1,1]

Conclusion: In OOD parameters, the encoder space is still continuous.

visualize optimal z.

2 residue dynamic parameters

none OOD	OOD	OOD

No compressor visualization.

3 disturbance values

visualize disturbance mapping

e->z

OOD e->z

mass	disturb	decay
![image-20230209191434616](/Users/reedpan/Library/Application Support/typora-user-images/image-20230209191434616.png)	![image-20230209191535401](/Users/reedpan/Library/Application Support/typora-user-images/image-20230209191535401.png)
resdyn & force scale	⚠️force scale	⚠️resdyn

none OOD e->z

mass	disturb	decay
		![image-20230209192350901](/Users/reedpan/Library/Application Support/typora-user-images/image-20230209192350901.png)
resdyn & force scale	force scale	resdyn

z-> control error

None OOD mass_max	OOD mass_max	None OOD disturb max	OOD disturb max

None OOD decay_max	OOD decay_max	none OOD res param	OOD res param

none OOD force scale	OOD force scale	noneOOD all max	OOD all max

e-> control error

higher dimensional case (12d uncertain parameters)

check with other kind of parameters.

Compressed to 1/2/3/4/5/6 dimensional value

1	2	3	4	5
![image-20230204213708462](/Users/reedpan/Library/Application Support/typora-user-images/image-20230204213708462.png)	![image-20230204213758802](/Users/reedpan/Library/Application Support/typora-user-images/image-20230204213758802.png)

PCA analysis with higher dimensional parameters

OOD performance evaluation

Performance in OOD case

Overall performance

Current objective: yellow line :arrow_right: blue line.
- since z_hat performance is close to z :arrow_right: z is the bottle neck.
Evaluation
- Compare none OOD environment encoder with OOD one (What dose it look like in OOD parameter mapping)
- e->z e->z* get inspiration regarding how to optimize z.
- Conclusion:
  - for OOD parameters, for mapping can be very imprecise.
  - with z_star, the performance could be better.

OOD
none OOD

Visualize performance given OOD/none OOD parameters.
- Conclusion:
  - optimization is hard in extrame parameters.

Training set	left boundary(OOD)	center parameter	right boundary(OOD)
100%
50%

conclusion:
- the compressor will make the parametric search convex -> good news for further optimization!
- In the OOD case, the inferred value might not be the optimal value optimal for performance -> good news for generalizability and other optimization-based methods.

🔧 Velocity3: Model Mismatch Cases.

Training without Residue model.

Description: training: without residue dynamics; testing: with residue dynamics.
Result:

$f(v,u,w)$

$d_w=1$ Expert	$d_w=1$ RMA before adaptation	$d_w=1$ RMA after adaptation	Vanilla(Robust)
0.7379	0.8989	0.6864	0.4900
$d_w=1$ C-4 Expert	$d_w=1$ C-4 RMA before adaptation	$d_w=1$ C-4 RMA after adaptation
0.6545	0.5992	0.5941

$f(v,u)$

$d_w=1$ Expert	$d_w=1$ RMA before adaptation	$d_w=1$ RMA after adaptation	Vanilla(Robust)
0.0685	0.0808	0.1392	0.0872
$d_w=1$ C-4 Expert	$d_w=1$ C-4 RMA before adaptation	$d_w=1$ C-4 RMA after adaptation
0.0659	0.0900	0.1254

Conclusion:
- Since the force is less sensitive to the input, the policy becomes more stable.
- The introduction of residue dynamics makes prediction very noisy, especially for damping ratio and residue force.

Training with a simplified model

Descriptions: training with simplified model. Then use the gradient-based method to update the policy again.
Result

Force Scale =[3, 3, 3]

Training with fitted model (32 trajectory, mean error=0.29) Fail to stablize.

$d_w=1$ Expert	$d_w=1$ RMA before adaptation	$d_w=1$ RMA after adaptation
0.4646	0.3908	0.2465

Training with fitted model (128 trajectory, mean error=0.07)

$d_w=1$ Expert	$d_w=1$ RMA before adaptation	$d_w=1$ RMA after adaptation
0.6057	0.5747	0.5860

Training with fitted model (512 trajectory, mean error=0.01)

$d_w=1$ Expert	$d_w=1$ RMA before adaptation	$d_w=1$ RMA after adaptation
0.1654	0.5242	0.3003

Training with true model

$d_w=1$ Expert	$d_w=1$ RMA before adaptation	$d_w=1$ RMA after adaptation
0.0718	0.0588	0.0355

Force Scale =[2, 4, 2]

Training with fitted model (32 trajectory, mean error=0.327) Fail to stablize.

$d_w=1$ Expert	$d_w=1$ RMA before adaptation	$d_w=1$ RMA after adaptation
1.4253	0.5395	0.7555

Training with fitted model (64 trajectory, mean error=0.102)

$d_w=1$ Expert	$d_w=1$ RMA before adaptation	$d_w=1$ RMA after adaptation
0.1978	0.1346	0.1117

Training with fitted model (128 trajectory, mean error=0.033)

$d_w=1$ Expert	$d_w=1$ RMA before adaptation	$d_w=1$ RMA after adaptation
0.3220	0.2422	0.1854

Training with true model

$d_w=1$ Expert	$d_w=1$ RMA before adaptation	$d_w=1$ RMA after adaptation
0.0781	0.0703	0.0447

Training with dropout model

$d_w=1$ Expert	$d_w=1$ RMA before adaptation	$d_w=1$ RMA after adaptation
0.1910	0.1951	0.1660

Conclusion
- The imperfect model makes the parameter prediction very imprecise.
- Significant performance deterioration is observed in these cases.
- After finetuning, the performance is still basically the same.
⚠️Note:
- When making the force scale smaller, the conclusion might differ.
- After reloading the policy and finetuning the policy again, the learned policy still performs poorly. (no performance gain. )
Interesting founding:
- For parametric OOD, further optimization might be a good idea. (But it is also very sensitive to training parameter ranges. )
- The policy performs poorly in non-parametric OOD cases, which could be our search question. (The best test case is sim2real).

TODO:ballot_box: tries simplified wind v.s. Real wind.

jc-bao commented 1 year ago

▶️ Vector: study the sim2real setting

Research question: identify the real-world problem from a high-fidelity simulation environment. (switch to bottom-up research style. )
Possible tasks: multi-drone, single drone agile, transportation, etc.

:wrench: Pybullet-based lab environment setup

Using this environment as Lab to do a sanity check

:wrench: Crazyswarm setup

mainly resolve the localization, control delay, controller mismatch, and design new velocity controller issue.
Target: better performance than MPPI in tracking tasks.
Substeps:
- Tracking: PID single -> PPO single -> PPO disturb single -> PPO multi
- Transportation: PID/MPC single -> PPO single -> PPO multi

jc-bao commented 1 year ago

🚗 Progress @2023.5.18

🗺️ Big Picture

deal with non-parametric uncertainty

✈️ Tasks

Single Agent: aggressive tracking (✔️ trained. 🔲 need better performance. ) + collision avoidance. (🤞🏻 fail to stabilize)
Multi-agent: aggressive tracking (2: ✔️ ) + collision avoidance. (🤞🏻 fail to train)

🥅 Next step

Sim2real: deploy single agent policy to verify 1️⃣ if the model mismatch matters 2️⃣ if our previous approach is useful.
- Subtask: sim2sim (dummy step -> fine sim step. )
- try tracking & jumping tasks to see which one has a larger gap.
- see the difference of those two kinds of dynamics.

🧷 Check list

[ ] method to verify non-parametric uncertainty.

jc-bao / policy-adaptation-survey

Project vector & velocity record. #12

:one: Identify research questions

Meeting@2023.1.17

Meeting@2023.1.28

:no_entry: Vector: expert performance under residue dynamics

❌ Velocity: Curriculum

:stop_button: Vector: adaptor performance gave imperfect model

❌ Velocity: Soft update adaptor

▶️ Vector: identify the adaptor module problem

🔧 Velocity1: Unobservable parameters.

Polynomial residue dynamics

MLP $f(v, w)$ residue dynamics

MLP $f(v, u,w)$ residue dynamics

🔧 Velocity2: Parameter OOD Cases.

Extrapolation OOD case

Intropolation OOD case

Visualize environment encoder mapping.

Performance in OOD case

🔧 Velocity3: Model Mismatch Cases.

Training without Residue model.

Training with a simplified model

TODO:ballot_box: tries simplified wind v.s. Real wind.

▶️ Vector: study the sim2real setting

:wrench: Pybullet-based lab environment setup

:wrench: Crazyswarm setup

🚗 Progress @2023.5.18

🗺️ Big Picture

✈️ Tasks

🥅 Next step

🧷 Check list