real-stanford / scalingup

[CoRL 2023] This repository contains data generation and training code for Scaling Up & Distilling Down
https://www.cs.columbia.edu/~huy/scalingup/
357 stars 22 forks source link

Real Robot Evaluation #14

Closed wangyan-hlab closed 1 year ago

wangyan-hlab commented 1 year ago

Hi Huy, @huy-ha

First, thank you for your kind help so far. I have been pushing the reproduction work forward and now I think I am ready to evaluate the policy on a real robot.

According to your guidance, I found the diffusion policy repo. There are 2 questions about it:

  1. In the policy folder, there are many diffusion policies. I would like to know which one exactly I should use? _(I guess it's diffusion_unet_hybrid_image_policy?)_ And do I need to edit the policy to fit your scalingup policy, or modify some other codes?
  2. In the eval_real_robot.py, it loads a checkpoint like this:
    # load checkpoint
    ckpt_path = input
    payload = torch.load(open(ckpt_path, 'rb'), pickle_module=dill)
    cfg = payload['cfg']
    cls = hydra.utils.get_class(cfg._target_)
    workspace = cls(cfg)
    workspace: BaseWorkspace
    workspace.load_payload(payload, exclude_keys=None, include_keys=None)

    but the checkpoint from my training seems to have a different structure and there isn't a 'cfg' key there, as well as other keys used in the script. Would you please give more information about how to modify the script to fix the issue?


Also, is it possible to load a model from the checkpoint and directly predict the action (i.e. eef position+eef uppermat+gripper command) with the REAL observation? I am trying to extract the code instantiating a diffusion policy model, provide a fake input, and hope to get some output:

import hydra
import torch
import numpy as np
from scalingup.data.dataset import StateTensor
from scalingup.data.window_dataset import StateSequenceTensor

def create_statetensor():

    device = 'cuda'

    shape = (1, 3)
    end_effector_position = np.random.rand(*shape)
    end_effector_position_tensor = torch.from_numpy(end_effector_position).float().to(device)

    print(end_effector_position)

    shape = (1, 9)
    end_effector_orientation = np.random.rand(*shape)
    end_effector_orientation_tensor = torch.from_numpy(end_effector_orientation).float().to(device)

    gripper_command = np.array([[1]])
    gripper_command_tensor = torch.from_numpy(gripper_command).float().to(device)

    shape = (1, 1, 3)
    input_xyz_pts = np.random.rand(*shape)
    input_xyz_pts_tensor = torch.from_numpy(input_xyz_pts).float().to(device)
    input_rgb_pts = np.random.rand(*shape)
    input_rgb_pts_tensor = torch.from_numpy(input_rgb_pts).float().to(device)

    occupancy_vol = np.array([[0]])
    occupancy_vol_tensor = torch.from_numpy(occupancy_vol).float().to(device)

    time = np.array([[0.25]])
    time_tensor = torch.from_numpy(time).float().to(device)

    shape = (1, 3, 160, 240)
    front_view = np.random.randint(0, 255, size=shape)
    front_view_tensor = torch.tensor(front_view, dtype=torch.uint8).to(device)
    topdown_view = np.random.randint(0, 255, size=shape)
    topdown_view_tensor = torch.tensor(topdown_view, dtype=torch.uint8).to(device)
    wrist_view = np.random.randint(0, 255, size=shape)
    wrist_view_tensor = torch.tensor(wrist_view, dtype=torch.uint8).to(device)
    views = {'front': front_view_tensor, 
            'top_down': topdown_view_tensor,
            'fr5/robotiq_2f85/d435i/rgb': wrist_view_tensor}

    state_tensor = StateTensor(
        end_effector_position=end_effector_position_tensor,
        end_effector_orientation=end_effector_orientation_tensor,
        gripper_command=gripper_command_tensor,
        input_xyz_pts=input_xyz_pts_tensor,
        input_rgb_pts=input_rgb_pts_tensor,
        occupancy_vol=occupancy_vol_tensor,
        time=time_tensor,
        views=views
    )

    return state_tensor

def create_input_data():
    sequence = []

    state_tensor_1 = create_statetensor()
    state_tensor_2 = create_statetensor()

    sequence.append(state_tensor_1)
    sequence.append(state_tensor_2)

    state_sequence_tensor = StateSequenceTensor(sequence=sequence)

    return state_sequence_tensor

@hydra.main(config_path="config/", config_name="inference", version_base="1.2")
def main(cfg):

    model = hydra.utils.instantiate(cfg.policy)
    print(type(model), model)

    with torch.no_grad():
        ###TRYING TO OUTPUT SOMETHING BUT HAVEN'T DONE
        input_data = create_input_data()
        # output = model(obs=input_data, task=..., seed=...)
        # print(output)
        output = model.get_stats(input_data)
        print(output)

if __name__ == "__main__":
    main()

But I haven't find a proper way to output something. Would you please give some suggestions?

Best Regards

Originally posted by @wangyan-hlab in https://github.com/columbia-ai-robotics/scalingup/issues/1#issuecomment-1706263853

huy-ha commented 1 year ago

Hey, happy to see you're making so much progress!

I used their codebase as a starting point, and made my own (hacky) scripts (scalingup_real_scripts.zip).

wangyan-hlab commented 1 year ago

Hi, @huy-ha

Thank you very much for your help! I have successfully evaluated the policy on our real FR5 robot. Although the evaluation keeps failing to finish a bin transport task, I believe it suffers from a suboptimal policy and the sim2real gap.

BTW, I find that my policy tends to start a transport action without actually grasping a object (e.g., if one finger contacts another without grasping a object, the robot still continue to move to the target bin). I wonder if it is related to the collision detection setup?

Best regards

yellow07200 commented 1 year ago

Hi @wangyan-hlab ,

Happy to see you did the successful real experiments!

I am still struggling with that. In real experiments, the output (action sequences) is always pointing to a weird direction. Can you please advise me on the following questions:

  1. Have you done any calibration or apply the transformation matrix between the front camera and the robot?
  2. Do we need to 100% rebuild the environment of the simulation (position of cameras)?

Appreciate your kind reply and help!

Best regards

huy-ha commented 1 year ago

Hey @wangyan-hlab,

Great to see you've done a first round of evaluations.

yellow07200 commented 1 year ago

Hi @huy-ha,

I didn't set the env/domain_rand_config... I think I need to regenerate the dataset and train the model again. https://github.com/real-stanford/scalingup/blob/3d2f43c213aed8b2c811e635ac8f3ef39bd210c4/scalingup/config/evaluation/single_env.yaml#L5

Thanks for your help!

Best regards

wangyan-hlab commented 1 year ago

Hey @wangyan-hlab,

Great to see you've done a first round of evaluations.

  • Did you use domain randomization? This should help with visual sim2real a lot.
  • You can consider increasing the magnitude of visual augmentations. This should help the vision encoder learn more transferrable representations.
  • Did the policy do well in simulation evaluation?
  • In your training data, did the policy observe retrying behavior? You can visualize all videos of a data generation process on weights and biases. In my experiments, there were plenty of retries, but I just want to eliminate this as a possible cause.

Hey @yellow07200 ,

In my experiments, since I used domain randomization over camera poses, I didn't have to calibrate. I just placed the camera in front of the robot where it roughly matched.

the output (action sequences) is always pointing to a weird direction

Did you load the action normalization in from the checkpoint? Is it completely off or is it close the the object but not on the object?

Hi, @huy-ha

Thank you for your reply.

I think I will first try to increase the magnitude of visual augmentations and improve the performance of sim evaluation. If the success rate of the sim evaluation rises, I will try the real evaluation again to see what will happen.

Best regards

huy-ha commented 1 year ago

Oh interesting. And this is with your FR5 setup right? Did the policy's accuracy on weights and biases converge yet, and how many datapoints did you use? Did you also try reproducing the transport results with the codebase's UR5, and did that work?

wangyan-hlab commented 1 year ago

Oh interesting. And this is with your FR5 setup right? Did the policy's accuracy on weights and biases converge yet, and how many datapoints did you use? Did you also try reproducing the transport results with the codebase's UR5, and did that work?

Yes, it is with my FR5 setup. I trained 5,000 steps x 10 epochs and the training loss seems to converge to about 0.003. The terminal output "Using 68,182 points from 662 trajectories out of 662 (100.0%)", so 68182 datapoints were supposed to be used. I haven't reproduced the results on a UR5 yet, but I do have access to a UR5e and maybe try it if possible.

yellow07200 commented 1 year ago

Hi @huy-ha,

Actually, the phenomenon I described in my last post happened in simulation (sorry for the unclear description), so the evaluation result was poor with an average success rate of 20~30%. The average success rate in data generation was over 70%.

I am facing the same issue when I change my setup to UR10.

Additionally, after I used domain randomization over camera poses to generate the dataset using the same setup in the original code (UR5), The training success rate is also only around 20%. Do you have any idea why this happens? Many thanks.

Best regards

huy-ha commented 1 year ago

@yellow07200 Could you share some code to reproduce the UR10 setup? Also, in my case, domain randomization with the original UR5 setup achieves >80%, so this is unexpected as well. Did you install the conda environment exactly as in the provided yaml file?

@wangyan-hlab That loss seems normal to me, but the behavior is very surprising. Can I reproduce this result with the latest commit from #18?

wangyan-hlab commented 1 year ago

@yellow07200 Could you share some code to reproduce the UR10 setup? Also, in my case, domain randomization with the original UR5 setup achieves >80%, so this is unexpected as well. Did you install the conda environment exactly as in the provided yaml file?

@wangyan-hlab That loss seems normal to me, but the behavior is very surprising. Can I reproduce this result with the latest commit from #18?

@huy-ha Hi, huy. Yes, I think the result would be reproduced with the latest commit from #18. Please let me know if there's any problem. Thank you very much.

huy-ha commented 1 year ago

Hey @wangyan-hlab ,

Thanks for being patience. Compute was tight due to CVPR deadline.

The steps I took include:

I trained 5,000 steps x 10 epochs and the training loss seems to converge to about 0.003.

Using the default configuration (10000 steps x 10 epochs), the diffusion loss at epoch 6 is 0.0038. MSE Loss /mean and /best were 0.0348 and 0.00786 respectively. Below I've attached some visualizations.

https://github.com/real-stanford/scalingup/assets/33562579/4d470a22-8c2c-41e2-85e6-ab3d2ad514c3 https://github.com/real-stanford/scalingup/assets/33562579/14166aca-e08b-4ef3-aa00-ea02b34abf0b https://github.com/real-stanford/scalingup/assets/33562579/704dd158-fce4-48f2-9b97-7c75498638f4 https://github.com/real-stanford/scalingup/assets/33562579/ed9d8641-6290-4f92-82dd-7bd7d5dcfd38

Qualitatively, the policy does exhibit retrying behavior. Quantitatively, its plateauing at 70% success rate, but I think its mostly due to the policy running out of time.

In summary, our data should have been identical, but I just ran data generation for longer to get more data. Our policy training configuration is identical except for how long I trained for. However, the results above were from epoch 6, so that policy didn't train much longer than yours did.

I'm still surprised about your result. I don't think the difference in the amount of data should have resulted in such a big difference. I would have guessed that with 600 trajectories yours would have achieved about 60% or so. I'll run another training with roughly a similar amount of data and let you know if it also performs poorly.

wangyan-hlab commented 1 year ago

Hi Huy @huy-ha,

Thank you very much for reproducing the results on FR5 robot!

Please allow me to briefly summarize the differences between your reproducing setup and mine:

  1. You hided the kinematic chain of FR5 robot in the prompt
  2. You used the top-down view + the wrist-mounted view instead of the front view + the wrist-mounted view in my config
  3. You generated much more trajectories (about 4x) than I did

I'm really happy to see your excellent reproducing results but also surprised about the differences.

Due to the limit of my device and time, I wasn't able to generate too much data for training. But as you say, my success rate was much lower than expected with 600 trajectories.

I really appreciate your help and look forward to your reply.

Good luck to you and your team at the CVPR!

Best regards

huy-ha commented 1 year ago

Yep! However, I don't think 1) contributed any difference because the data both you and I had succeeded around 70% of the time and had retry attempts. 2) and 3) are the significant differences. I'll let you know when I get the results.

huy-ha commented 1 year ago

Hey @wangyan-hlab ,

Just a quick update.

Not surprisingly, top down camera views are better this grasping task than wrist mounted ones, and more data does better.

You used 662 trajs/68182pts, but only got 20 - 30%. I think it can still reach close to 44% if you just leave it training for longer.

Hope these experiment results help!