michel-aractingi commented 1 week ago

HIL-SERL in LeRobot

On porting HIL-SERL to LeRobot. This page will outline the minimal list of components and tasks that should be implemented in the LeRobot codebase. The official reference of implementation is available in JAX here.

We will coordinate on discord #port-hil-serl. We will update this page with ID of owners of each components. We encourage several people to work in team on each component. You don’t need to write extensive code on your own to make a valuable contribution. Any input on a sub-component, however small, is appreciated. Feel free to add extra component to the list, if needed; this is only a guide, and we welcome more ideas.

Note: In parallel, we are refactoring the codebase, thus you don't need to refactor yourself. Do not hesitate to copy files and code elements to arrive at a first working version as fast as possible.

RLPD (Reinforcement Learning with Prior Data)
- Goal: Develop the base RL algorithm for HIL-SERL. RLPD is an off-policy RL algorithm that leverages offline data.
- Tasks:
  - [ ] Set up the neural network architectures for the policy and Q-function with LayerNorm in lerobot/lerobot/common/policies/hilserl
  - [ ] Define the SAC loss functions for fitting the Q-value and policy optimization steps.
  - [ ] Implement random ensemble distillation for the Q-function and double q-learning.
  - [ ] Compute target policy entropy for the policy update step.
  - [ ] Define the update mechanism to automatically adjust the temperature variable.
  - [ ] Set up the target network for the Q-value predictions and its update mechanism.
- Useful links:
  - Original implementation of RLPD, HIL-SERL implementation.
  - SAC explanation in Spinning Up, paper.
  - Training script in lerobot/scripts/train.py for offline and online data buffers and dataloader. TD-MPC implementation in LeRobot lerobot/common/policies/tdmpc/.
Human Interventions
- Goal: Develop the mechanism to add human interventions during online training. HIL-SERL uses a 3DSpaceMouse mouse to control the robot's end-effector. We can use the leader arm to do that.
- Tasks:
  - [ ] Define logic and function to stop the policy and take over in record function Possibly interfaced with keyboard keys to stop the policy and give a few seconds for the user to be ready to take-over.
  - [ ] Define the necessary functions for the leader to follow the position of the follower and start from the same position at the moment of intervention.
  - [ ] Define the logic to differentiate between the data collected from human interventions vs the offline data and online data ; by adding an extra column to the HF dataset when adding new episodes.
  - [ ] Define the sampling logic proposed in HIL-SERL for each category of data, e.g., play on the sampler weights by giving 1 for the offline data, 1 for the online data and 2 for the human interventions.
- Useful links:
  - Alex's TD-MPC real fork checkout some of the scripts he made for real world training.
  - HG-DAgger paper.
Reward Classifier
- Goal: Build a reward classifier that returns a sparse reward indicating whether the frame is in a terminal success state or not.
- Tasks:
  - [ ] Define logic to label frames of the collected trajectories with SUCCESS or FAILURE. Ideally if the demonstration is successful we can label the last few timesteps as success and vice versa.
  - [ ] Define a reward classifier class that learns to categorize the observations with rewards {-1, 0, 1}. Zero can be for the frames in the middle of an episode before reaching a terminal state, or we can do it in a binary fashion.
  - [ ] Integrate the reward classifier either in lerobot/scripts/eval.py or in the RLPD code to query the reward every time a new frame is added to the online dataset.
- Useful links:
  - HIL-SERL paper appendix B.
Other Implementations
- Several implementations proposed in HIL-SERL are key for it to work efficiently or to improve the overall performance. Here are a few that can be added to LeRobot.
  - [ ] Pre-process images: They utilize image cropping to focus on area of interests. Images are resized in the paper to 128x128. (see this PR to be merged that add resize function https://github.com/huggingface/lerobot/pull/459 and create a new PR that add cropping function)
  - [ ] Augment proprioception with Velocity and Force feedback: Augment observation space with joint velocities/torques in ManipulatorRobot ; we need to make sure the name linked to the address is the same for feetech and dynamixel motors. This is the velocity for feetech Present_Speed and torque for feetech Present_Current.
  - [ ] Penalize gripper actions: Add a penalty on the gripper actions during grasping tasks to avoid unnecessary use of the gripper.
  - [ ] Add simluation support: To simplify experimentation we can try to run HIL-SERL on sim. The main simulation enviornment for now is gym_lowcostrobot. All these components can be tested in sim. Further tasks that resemble those in the paper can be added as well.

Note: The paper uses end-effector control and velocity control for dynamic tasks, but our first implementation won't include them.

michel-aractingi commented 2 days ago

Regarding RLPD

We can use the pushT environment to test if RLPD is working properly. This will also allow us to compare to our baseline RL algorithm TD-MPC.

PushT has two modes of observations, an image state and a privileged vector state with 'keypoints'. Training with the keypoints state is an easier task that can be useful to quickly validate that your implementation is working. Training with the image state is our end-goal.

You can try training PushT with our TD-MPC to get a better idea. Here are the relevant config files and training commands. Make sure that you enable wandb and set it properly on your system so that you can monitor training and observe the eval runs.

Config files (change to .yaml and add to lerobot/lerobot/configs/policy): tdmpc_pusht_keypoints.txt tdmpc_pusht.txt

Run the training commands in the following files: train_pusht_keypoints.txt train_pusht.txt

For more references on TD-MPC: main paper, FOWM paper, Alexander Soare videos 1 and 2.

jianlanluo commented 2 days ago

Thanks for initiating this! I would actually recommend using Cartesian space control whenever you can do that, as in our experience it simplifies a lot of stuff in the learning process.

But I guess many audience in this PR are also interested in using RL for low-cost robots which don't have built-in EE control, so I am also curious how that works in practice.

@jeffreywu13579 @charlesxu0124

huggingface / lerobot

Porting HIL-SERL #504

HIL-SERL in LeRobot

Note: The paper uses end-effector control and velocity control for dynamic tasks, but our first implementation won't include them.

Regarding RLPD