Real Robot Evaluation - Githubissues

wangyan-hlab commented 1 year ago

Hi Huy, @huy-ha

First, thank you for your kind help so far. I have been pushing the reproduction work forward and now I think I am ready to evaluate the policy on a real robot.

According to your guidance, I found the diffusion policy repo. There are 2 questions about it:

In the policy folder, there are many diffusion policies. I would like to know which one exactly I should use? _(I guess it's diffusion_unet_hybrid_image_policy?)_ And do I need to edit the policy to fit your scalingup policy, or modify some other codes?
In the eval_real_robot.py, it loads a checkpoint like this:
```
# load checkpoint
ckpt_path = input
payload = torch.load(open(ckpt_path, 'rb'), pickle_module=dill)
cfg = payload['cfg']
cls = hydra.utils.get_class(cfg._target_)
workspace = cls(cfg)
workspace: BaseWorkspace
workspace.load_payload(payload, exclude_keys=None, include_keys=None)
```
but the checkpoint from my training seems to have a different structure and there isn't a 'cfg' key there, as well as other keys used in the script. Would you please give more information about how to modify the script to fix the issue?

Also, is it possible to load a model from the checkpoint and directly predict the action (i.e. eef position+eef uppermat+gripper command) with the REAL observation? I am trying to extract the code instantiating a diffusion policy model, provide a fake input, and hope to get some output:

import hydra
import torch
import numpy as np
from scalingup.data.dataset import StateTensor
from scalingup.data.window_dataset import StateSequenceTensor

def create_statetensor():

    device = 'cuda'

    shape = (1, 3)
    end_effector_position = np.random.rand(*shape)
    end_effector_position_tensor = torch.from_numpy(end_effector_position).float().to(device)

    print(end_effector_position)

    shape = (1, 9)
    end_effector_orientation = np.random.rand(*shape)
    end_effector_orientation_tensor = torch.from_numpy(end_effector_orientation).float().to(device)

    gripper_command = np.array([[1]])
    gripper_command_tensor = torch.from_numpy(gripper_command).float().to(device)

    shape = (1, 1, 3)
    input_xyz_pts = np.random.rand(*shape)
    input_xyz_pts_tensor = torch.from_numpy(input_xyz_pts).float().to(device)
    input_rgb_pts = np.random.rand(*shape)
    input_rgb_pts_tensor = torch.from_numpy(input_rgb_pts).float().to(device)

    occupancy_vol = np.array([[0]])
    occupancy_vol_tensor = torch.from_numpy(occupancy_vol).float().to(device)

    time = np.array([[0.25]])
    time_tensor = torch.from_numpy(time).float().to(device)

    shape = (1, 3, 160, 240)
    front_view = np.random.randint(0, 255, size=shape)
    front_view_tensor = torch.tensor(front_view, dtype=torch.uint8).to(device)
    topdown_view = np.random.randint(0, 255, size=shape)
    topdown_view_tensor = torch.tensor(topdown_view, dtype=torch.uint8).to(device)
    wrist_view = np.random.randint(0, 255, size=shape)
    wrist_view_tensor = torch.tensor(wrist_view, dtype=torch.uint8).to(device)
    views = {'front': front_view_tensor, 
            'top_down': topdown_view_tensor,
            'fr5/robotiq_2f85/d435i/rgb': wrist_view_tensor}

    state_tensor = StateTensor(
        end_effector_position=end_effector_position_tensor,
        end_effector_orientation=end_effector_orientation_tensor,
        gripper_command=gripper_command_tensor,
        input_xyz_pts=input_xyz_pts_tensor,
        input_rgb_pts=input_rgb_pts_tensor,
        occupancy_vol=occupancy_vol_tensor,
        time=time_tensor,
        views=views
    )

    return state_tensor

def create_input_data():
    sequence = []

    state_tensor_1 = create_statetensor()
    state_tensor_2 = create_statetensor()

    sequence.append(state_tensor_1)
    sequence.append(state_tensor_2)

    state_sequence_tensor = StateSequenceTensor(sequence=sequence)

    return state_sequence_tensor

@hydra.main(config_path="config/", config_name="inference", version_base="1.2")
def main(cfg):

    model = hydra.utils.instantiate(cfg.policy)
    print(type(model), model)

    with torch.no_grad():
        ###TRYING TO OUTPUT SOMETHING BUT HAVEN'T DONE
        input_data = create_input_data()
        # output = model(obs=input_data, task=..., seed=...)
        # print(output)
        output = model.get_stats(input_data)
        print(output)

if __name__ == "__main__":
    main()

But I haven't find a proper way to output something. Would you please give some suggestions?

Best Regards

Originally posted by @wangyan-hlab in https://github.com/columbia-ai-robotics/scalingup/issues/1#issuecomment-1706263853

huy-ha commented 1 year ago

Hey, happy to see you're making so much progress!

I used their codebase as a starting point, and made my own (hacky) scripts (scalingup_real_scripts.zip).

Main eval script: You'll find the loop in scaling_up_eval_real_robot.py. Most of the structure is identical to that of diffusion policy's, and it uses their multiprocess shared memory ring buffer stack. If I'm not mistaken, you have a FR5 setup, not a UR5. I would start by making a separate class to support your hardware platform (e.g., change out RTDEInterpolationController, etc.)
Set Up: This is the hackiest part of the collection of scripts, because I just wanted to inject my policy and configuration into the diffusion policy codebase. You'll find everything that should go into a command line argument configuration (e.g., which check point to use, what image resolution) in scaling_up_constants.py, and the entire policy/network definition inside scaling_up_policy.py.
Debugging: Preferably, you'd get an idea of what kind of actions your policy will do before it actually runs on the real robot. You can collect some real world policy inputs using scaling_up_generate_test_cases.py, evaluate these policies offline using scaling_up_offline_evaluation.py, and visualize the policy's predicted actions as pointclouds using scaling_up_visualize_actions.py. Lastly, you can also just directly replay an action from simulation (open loop, does not use any learned policy) in the real environment using scaling_up_replay_real_robot.py, just to test that your sim actions are indeed reasonable in your real world setup. This script will be a good starting point to make sure you've supported your FR5 setup correctly in the diffusion policy codebase.

Hopefully these will still give you a good starting point!

wangyan-hlab commented 1 year ago

Hi, @huy-ha

Thank you very much for your help! I have successfully evaluated the policy on our real FR5 robot. Although the evaluation keeps failing to finish a bin transport task, I believe it suffers from a suboptimal policy and the sim2real gap.

BTW, I find that my policy tends to start a transport action without actually grasping a object (e.g., if one finger contacts another without grasping a object, the robot still continue to move to the target bin). I wonder if it is related to the collision detection setup?

Best regards

yellow07200 commented 1 year ago

Hi @wangyan-hlab ,

Happy to see you did the successful real experiments!

I am still struggling with that. In real experiments, the output (action sequences) is always pointing to a weird direction. Can you please advise me on the following questions:

Have you done any calibration or apply the transformation matrix between the front camera and the robot?
Do we need to 100% rebuild the environment of the simulation (position of cameras)?

Appreciate your kind reply and help!

Best regards

huy-ha commented 1 year ago

Hey @wangyan-hlab,

Great to see you've done a first round of evaluations.

Did you use domain randomization? This should help with visual sim2real a lot.
You can consider increasing the magnitude of visual augmentations. This should help the vision encoder learn more transferrable representations.
Did the policy do well in simulation evaluation?
In your training data, did the policy observe retrying behavior? You can visualize all videos of a data generation process on weights and biases. In my experiments, there were plenty of retries, but I just want to eliminate this as a possible cause.

Hey @yellow07200 ,

In my experiments, since I used domain randomization over camera poses, I didn't have to calibrate. I just placed the camera in front of the robot where it roughly matched.

the output (action sequences) is always pointing to a weird direction

Did you load the action normalization in from the checkpoint? Is it completely off or is it close the the object but not on the object?

yellow07200 commented 1 year ago

Hi @huy-ha,

I didn't set the env/domain_rand_config... I think I need to regenerate the dataset and train the model again. https://github.com/real-stanford/scalingup/blob/3d2f43c213aed8b2c811e635ac8f3ef39bd210c4/scalingup/config/evaluation/single_env.yaml#L5

Thanks for your help!

Best regards

wangyan-hlab commented 1 year ago

Hey @wangyan-hlab,

Great to see you've done a first round of evaluations.

Did you use domain randomization? This should help with visual sim2real a lot.

You can consider increasing the magnitude of visual augmentations. This should help the vision encoder learn more transferrable representations.

Did the policy do well in simulation evaluation?

In your training data, did the policy observe retrying behavior? You can visualize all videos of a data generation process on weights and biases. In my experiments, there were plenty of retries, but I just want to eliminate this as a possible cause.

Hey @yellow07200 ,

In my experiments, since I used domain randomization over camera poses, I didn't have to calibrate. I just placed the camera in front of the robot where it roughly matched.

the output (action sequences) is always pointing to a weird direction

Did you load the action normalization in from the checkpoint? Is it completely off or is it close the the object but not on the object?

Hi, @huy-ha

Thank you for your reply.

Yes, I have used the domain randomization in data generation process.
I haven't tried increasing the magnitude of visual augmentations, I will have a try.
Actually, the phenomenon I described in my last post happened in simulation (sorry for the unclear description), so the evaluation result was poor with an average success rate of 20~30%. The average success rate in data generation was over 70%. And in the sim evaluation, the robot usually stopped returning to the right bin after ONE failed try (it got stuck above the left bin until the time is over).
Yes, there were a lot of retries in the training data.

I think I will first try to increase the magnitude of visual augmentations and improve the performance of sim evaluation. If the success rate of the sim evaluation rises, I will try the real evaluation again to see what will happen.

Best regards

huy-ha commented 1 year ago

Oh interesting. And this is with your FR5 setup right? Did the policy's accuracy on weights and biases converge yet, and how many datapoints did you use? Did you also try reproducing the transport results with the codebase's UR5, and did that work?

wangyan-hlab commented 1 year ago

Oh interesting. And this is with your FR5 setup right? Did the policy's accuracy on weights and biases converge yet, and how many datapoints did you use? Did you also try reproducing the transport results with the codebase's UR5, and did that work?

Yes, it is with my FR5 setup. I trained 5,000 steps x 10 epochs and the training loss seems to converge to about 0.003. The terminal output "Using 68,182 points from 662 trajectories out of 662 (100.0%)", so 68182 datapoints were supposed to be used. I haven't reproduced the results on a UR5 yet, but I do have access to a UR5e and maybe try it if possible.

yellow07200 commented 1 year ago

Hi @huy-ha,

Actually, the phenomenon I described in my last post happened in simulation (sorry for the unclear description), so the evaluation result was poor with an average success rate of 20~30%. The average success rate in data generation was over 70%.

I am facing the same issue when I change my setup to UR10.

Additionally, after I used domain randomization over camera poses to generate the dataset using the same setup in the original code (UR5), The training success rate is also only around 20%. Do you have any idea why this happens? Many thanks.

Best regards

huy-ha commented 1 year ago

@yellow07200 Could you share some code to reproduce the UR10 setup? Also, in my case, domain randomization with the original UR5 setup achieves >80%, so this is unexpected as well. Did you install the conda environment exactly as in the provided yaml file?

@wangyan-hlab That loss seems normal to me, but the behavior is very surprising. Can I reproduce this result with the latest commit from #18?

wangyan-hlab commented 1 year ago

@yellow07200 Could you share some code to reproduce the UR10 setup? Also, in my case, domain randomization with the original UR5 setup achieves >80%, so this is unexpected as well. Did you install the conda environment exactly as in the provided yaml file?

@wangyan-hlab That loss seems normal to me, but the behavior is very surprising. Can I reproduce this result with the latest commit from #18?

@huy-ha Hi, huy. Yes, I think the result would be reproduced with the latest commit from #18. Please let me know if there's any problem. Thank you very much.

huy-ha commented 1 year ago

Hey @wangyan-hlab ,

Thanks for being patience. Compute was tight due to CVPR deadline.

The steps I took include:

1. Check the language model output. I started from examples/exploration_task_tree.py and replaced the environment config name with the FR5 bin transport config. Running it showed me that the entire kinematic hierarchy of the robot appeared in the language prompt, which is not useful information for the LLM. In my code before, I hardcoded "UR5" into the LanguageStateEncoder to filter out the robot's kinematic chain. I've made the code more generic to robot name, so now the FR5's kinematic chain doesn't appear in the prompt (https://github.com/real-stanford/scalingup/commit/218a618b8922da5018e824721011dd6727f2f8f9).
1. Policy Observation Space: The default list of cameras used for data generation consisted three cameras, but most tasks usually only need one global workspace view and one local detailed view. I used the top down view and the wrist-mounted view (changed in common.yaml).
1. Generated some data: I got a 68.18% success rate for data generation. The behavior of the data generation policy before and after the change in https://github.com/real-stanford/scalingup/commit/218a618b8922da5018e824721011dd6727f2f8f9 should be identical. Namely, there is retry in the data, and our data generation success rates roughly matched. I generated 2594 trajectories, which amounts to 258457 points.

I trained 5,000 steps x 10 epochs and the training loss seems to converge to about 0.003.

Using the default configuration (10000 steps x 10 epochs), the diffusion loss at epoch 6 is 0.0038. MSE Loss /mean and /best were 0.0348 and 0.00786 respectively. Below I've attached some visualizations.

https://github.com/real-stanford/scalingup/assets/33562579/4d470a22-8c2c-41e2-85e6-ab3d2ad514c3 https://github.com/real-stanford/scalingup/assets/33562579/14166aca-e08b-4ef3-aa00-ea02b34abf0b https://github.com/real-stanford/scalingup/assets/33562579/704dd158-fce4-48f2-9b97-7c75498638f4 https://github.com/real-stanford/scalingup/assets/33562579/ed9d8641-6290-4f92-82dd-7bd7d5dcfd38

Qualitatively, the policy does exhibit retrying behavior. Quantitatively, its plateauing at 70% success rate, but I think its mostly due to the policy running out of time.

In summary, our data should have been identical, but I just ran data generation for longer to get more data. Our policy training configuration is identical except for how long I trained for. However, the results above were from epoch 6, so that policy didn't train much longer than yours did.

I'm still surprised about your result. I don't think the difference in the amount of data should have resulted in such a big difference. I would have guessed that with 600 trajectories yours would have achieved about 60% or so. I'll run another training with roughly a similar amount of data and let you know if it also performs poorly.

wangyan-hlab commented 1 year ago

Hi Huy @huy-ha,

Thank you very much for reproducing the results on FR5 robot!

Please allow me to briefly summarize the differences between your reproducing setup and mine:

You hided the kinematic chain of FR5 robot in the prompt
You used the top-down view + the wrist-mounted view instead of the front view + the wrist-mounted view in my config
You generated much more trajectories (about 4x) than I did

I'm really happy to see your excellent reproducing results but also surprised about the differences.

Due to the limit of my device and time, I wasn't able to generate too much data for training. But as you say, my success rate was much lower than expected with 600 trajectories.

I really appreciate your help and look forward to your reply.

Good luck to you and your team at the CVPR!

Best regards

huy-ha commented 1 year ago

Yep! However, I don't think 1) contributed any difference because the data both you and I had succeeded around 70% of the time and had retry attempts. 2) and 3) are the significant differences. I'll let you know when I get the results.

huy-ha commented 1 year ago

Hey @wangyan-hlab ,

Just a quick update.

The previous run, which used top down cameras + wrist mount with 2594 trajs/258457 pts,actually ended up at 78% success rate.
One run which used top down cameras + wrist mount with 280 trajs/28529 pts, is reaching 50% success rates.
One run which used front cameras + wrist mount with 741 trajs/ 68262 pts, is getting 44% success rates.

Not surprisingly, top down camera views are better this grasping task than wrist mounted ones, and more data does better.

You used 662 trajs/68182pts, but only got 20 - 30%. I think it can still reach close to 44% if you just leave it training for longer.

Hope these experiment results help!

real-stanford / scalingup

Real Robot Evaluation #14