Thomwolf 2024 06 12 deep dive act

thomwolf commented 3 months ago

[DRAFT WIP] Work in progress: deep dive in differences between our ACT implementation for real world data and the implementation from https://github.com/thomwolf/ACT

The goal is to see if we find some room for improvements in our ACT short data trainings (reducing jitter in the same conditions as the original ACT code)

Currently listed differences:

we don't take a min of 1e-2 in the normalization of std for action/pos: I don't think has any impact since my std are much higher anyway
validation split: the original ACT code use 0.8 ratio of train/validation: thus only using 40 episodes for training on a 50 episodes dataset
masking: in the VAE encoder we don't mask the padding tokens. Maybe this one is more important
I removed the 1e-8 in the normalization in favor of capping the min value of std like the original ACT, I think in general it will bring less error
took me some time to find: we don't sort the input images by alphabetical order of the image keys when stacking so if you switch the order of some image key names in the config dict will switch the inputs. this one is tricky because you usually don't think the relative order of named arguments in config are important. fixing to be deterministically alphabetical order for image keys
unfreezing the batchnorms in the resnet like the original I use. not sure how important this one is but we'll see.
We don't tune the LR of the backbone independently of the LR of the body.

This change is

haixuanTao commented 3 months ago

Just a quick note on gym_real_world. At pollen robotics, we tried to add a new task id input and:

We had to make significant change within the gym environment to make it work. Raising the question: can our current gym setup be extensible for user to add feature?
It was also actually very hard to debug. For example we forgot to change the observation space and instead of having an issue within the step method, we simply did not find the input task id within the observations. and so raising the question: is gym env easily debuggable?
It's hard to make a small model testing script. The intertwining of the model and the environment makes it super hard to swap model between environments or use the model in a context where you're not creating an environment. Ex: a 100LoC script of the model and a for loop.

I think that the example API should be in the likes of:

policy = make_policy(hydra_cfg, pretrained_policy_name_or_path)

while True:

observation = {
    "image": cv2.VideoCapture(0),
    "qpos": dynamixel.read([0, 1, 2, 3, 4, 5, 6]]
}

observation = preprocess_observation(observation)

action = policy.select_action(observation)

time.sleep(1/fps - inference_time)

dynamixel.write(action, [0, 1, ,2 ,3, 4, 5, 6])



 I think that we should let the user handle defining gym_env and let him manage it if he wants to do, but should probably not be the default way of using lerobot,

So basically, i'm opening the discussion as can and should we remove the gym environment at inference time for the real world

thomwolf commented 3 months ago

Thank @haixuanTao, maybe more a comment for https://github.com/huggingface/lerobot/pull/246?

This PR is really just a deep dive in the algorithmic differences between our implementation of ACT and the original, in particular on the model side. I'll probably close this PR and open a series of smaller ones updating some of these differences.

huggingface / lerobot

Thomwolf 2024 06 12 deep dive act #265