nickgkan / 3d_diffuser_actor

Code for the paper "3D Diffuser Actor: Policy Diffusion with 3D Scene Representations"
https://3d-diffuser-actor.github.io/
MIT License
159 stars 16 forks source link

Question about the predicting strategy #35

Closed AaronChuh closed 1 month ago

AaronChuh commented 1 month ago

Hi, thank you for your nice work. I have a question about the predicting strategy about your model.

Diffusion Policy predicts k steps once instead of the whole trajectory, which means in training stage, they not only sample a trajectory but also randomly sample a slice with a length of (history + k steps to predicted).

In your paper, I find

'During inference, 3D Diffuser Actor can either predict and execute the full trajectory of actions upto the next keypose (including the keypose), or just predict the next keypose and use a sampling-based motion planner to reach it'

It seems your model can predict k steps or just 1 step. So I wonder how many steps your model predict in a complete denoising process. How do you inplement in code? I've checked your codes but I find your calvin dataset class only returns a full trajectory in iteration. Maybe I missed something important.

Look forward to your kind reply. Thanks!

nickgkan commented 1 month ago

Hi, We use the keypose temporal abstraction. This means that during training the demo is broken down into few poses that may have some semantic meaning (e.g. the gripper changes state or stops near an object). We train to predict either only the next keypose, which may be several trajectory steps away, or the next keypose and the full trajectory up to that keypose. In that case, to handle batching, we interpolate a fixed number of steps in that trajectory.

During inference, for RLBench we predict the next keypose and then use a planner to get there. For CALVIN we predict jointly the next keypose and the trajectory to get there. Note that the keypose is usually temporally far in the future. Let's say an episode can be completed in N steps. If Diffusion Policy predicts and executes k steps (in fact they predict k but execute less), then they need ~N/k predictions in total. On the other hand, our model would predict K keyposes (K is variable, our model is free to predict keyposes until episode completion). Because keyposes are temporally far, usually K << N/k.

I hope this helps.

AaronChuh commented 1 month ago

Hi, We use the keypose temporal abstraction. This means that during training the demo is broken down into few poses that may have some semantic meaning (e.g. the gripper changes state or stops near an object). We train to predict either only the next keypose, which may be several trajectory steps away, or the next keypose and the full trajectory up to that keypose. In that case, to handle batching, we interpolate a fixed number of steps in that trajectory.

During inference, for RLBench we predict the next keypose and then use a planner to get there. For CALVIN we predict jointly the next keypose and the trajectory to get there. Note that the keypose is usually temporally far in the future. Let's say an episode can be completed in N steps. If Diffusion Policy predicts and executes k steps (in fact they predict k but execute less), then they need ~N/k predictions in total. On the other hand, our model would predict K keyposes (K is variable, our model is free to predict keyposes until episode completion). Because keyposes are temporally far, usually K << N/k.

I hope this helps.

Thanks for your quick reply! And I have another question.

You just mentioned that for RLBench you predict the next keypose. If not mistaked, a keypose is a 7-dimension (or 10) vector, same as a single step in a trajectory, right? Thus the diffusing part is a 7-dimension (or 10) vector. Diffusion Policy or other diffusion-based planner predict k steps at once, because they think this design could enhance time consistency and make the training process more stable.

In another issue, you mentioned that training on RLBench needs more iterations and computing resource than on CALVIN. Is it because the task in RLBench is more difficult or training a diffusion model that predicts only one step is more unstable? I'm just curious about it and haven't done any experiment. Have you conducted any experiment on it? Could you share your consideration about the different training strategy on RLBench and CALVIN?

Thanks!

twke18 commented 1 month ago

We cannot conclude if CALVIN is more challenging than RLBench, and vice versa. They have difference design choices.

On RLBench, the manipulation tasks are more diverse (e.g. turn on the faucet, place the mug on the rack). However, the testing scene is the same as training scene, varied only in the appearance and location of interacting objects. The end-effector trajectories of demonstrations are clean, which are generated from oracle scripts. The language instruction also has little diversity.

On the other hand, on CALVIN, the manipulation tasks are simple (e.g. pick up the block, open the drawer). However, the training and testing scene are different, and the benchmark requires completing a series of complex language instructions. Moreover, the end-effector trajectories of demonstrations are noisy, which are generated from human play data.

Our huger need for computing resource on RLBench is due to the increasing number of camera views and the image resolution.

AaronChuh commented 1 month ago

We cannot conclude if CALVIN is more challenging than RLBench, and vice versa. They have difference design choices.

On RLBench, the manipulation tasks are more diverse (e.g. turn on the faucet, place the mug on the rack). However, the testing scene is the same as training scene, varied only in the appearance and location of interacting objects. The end-effector trajectories of demonstrations are clean, which are generated from oracle scripts. The language instruction also has little diversity.

On the other hand, on CALVIN, the manipulation tasks are simple (e.g. pick up the block, open the drawer). However, the training and testing scene are different, and the benchmark requires completing a series of complex language instructions. Moreover, the end-effector trajectories of demonstrations are noisy, which are generated from human play data.

Our huger need for computing resource on RLBench is due to the increasing number of camera views and the image resolution.

Got it! Thanks again for your nice work and kind reply!