octo-models / octo

Octo is a transformer-based robot policy trained on a diverse mix of 800k robot trajectories.
https://octo-models.github.io/
MIT License
885 stars 166 forks source link

Issue of gym_wrapper and action_heads #93

Open BUAAZhangHaonan opened 6 months ago

BUAAZhangHaonan commented 6 months ago

About gym_wrapper

What should be the correct order of HistoryWrapper, RHCWrapper, TemporalEnsembleWrapper and UnnormalizeActionProprio. According to the code in 03_eval_finetuned.py, the order is:

HistoryWrapper->RHCWrapper->UnnormalizeActionProprio

https://github.com/octo-models/octo/blob/cab7f94b4db2dd93063d9c7f3482360743e22ec7/examples/03_eval_finetuned.py#L67 the order in visualization_lib.py is:

HistoryWrapper->RHCWrapper->TemporalEnsembleWrapper->UnnormalizeActionProprio

https://github.com/octo-models/octo/blob/cab7f94b4db2dd93063d9c7f3482360743e22ec7/octo/utils/visualization_lib.py#L292 but in gym_wrapper.py is:

UnnormalizeActionProprio->RHCWrapper->HistoryWrapper

https://github.com/octo-models/octo/blob/cab7f94b4db2dd93063d9c7f3482360743e22ec7/octo/utils/gym_wrappers.py#L53 Does different nesting order, especially the order of HistoryWrapper and RHCWrapper, have any impact on the results? At the same time, I noticed that the parameter horizon in HistoryWrapper will have a negative impact on the prediction results when it exceeds 5, which seems to be different from intuitive understanding. After all, providing a longer observation history should get better results. When horizon is equal to 1 or 2 the result is good. Is this normal? How does OctoTransformer handle stacked observations from HistoryWrapper?

About action_heads

I conducted fine-tuning tests on three different action heads and found that the results of L1 and MSE were stable, while the diffusion head is very unstable: without enough steps of training, the simulation could not even be performed normally: https://github.com/octo-models/octo/issues/43#issuecomment-2105933956

https://github.com/octo-models/octo/assets/107928115/5f0b1008-3993-4112-9460-c59c42a580d0

I trained the diffusion head for 50,000 steps on ACT, and the results were not satisfactory, even far inferior to L1 and MSE, which had been trained for 5,000 steps. How should the parameters or network structure be adjusted to make the diffusion policy perform at its best? At the same time, I also noticed that L1 head can show correct movements after 1000 steps of training, but it is still poor in details, manifested in the inability to grasp objects correctly, but it improved after 5000 steps of training.

https://github.com/octo-models/octo/assets/107928115/c09f157c-327a-4617-bcd4-2b1fc767a097

Does this mean that fine-tuning is a sign of overfitting? And the training cost required to reach usability seems to be relatively high.