Multi-Image support for VQ-BeT

bkpcoding commented 1 month ago

Hello, I wanted to ask if there is a possibility to have VQ-BeT running on multiple camera's for some environments that have different views, like Robomimic? If so can someone give me points on what exactly I need to change, I would be happy to submit a PR once I get it working on my side and finish the ICLR deadline!

Currently, if I understand correctly we need to change the VQBeTRgbEncoder, it seems like it supports multiple camera views but there is an assert statement that checks the length of the image views to be 1. Is there a specific reason for this assert statement, i.e., I need to change something else?

alexander-soare commented 1 month ago

Hi! This shouldn't be too difficult. Check out this older PR that did something similar with Diffusion Policy: https://github.com/huggingface/lerobot/pull/218.

You'll need to manage the plumbing. From an actual NN architecture perspective it's pretty basic, just add the image in as another observation token. Also check ACT as that will be more similar to VQ-BeT in this sense: it treats each image as a separate token.

On another note, we are probably going to do a fairly major refactor to the way policies handle inputs/outputs some time soon.

bkpcoding commented 1 month ago

Thank you so much for pointing out the PR. I think I will do something similar.

huggingface / lerobot

Multi-Image support for VQ-BeT #407