Slow Diffusion Policy Performance on Transport MH Dataset (Image Input)

shim0114 commented 1 year ago

Dear @cheng-chi,

I would like to bring to your attention a performance issue I've encountered when working with the Transport MH dataset (image input) in robomimic. In particular, the performance of the diffusion policy seems to be significantly slower than expected.

Here are the details of my setup:

Hardware: NVIDIA TITAN RTX
Iteration speed: approximately 1.20 iterations/second
Time per epoch: approximately 40 minutes

The command I use to run the training process is as follows:

python train.py --config-dir configs/image/transport_mh/diffusion_policy_cnn --config-name=config.yaml training.seed=42 training.device=cuda:2 hydra.run.dir=data/outputs/${now:%Y.%m.%d}/${now:%H.%M.%S}_${name}_${task_name} dataloader.batch_size=64 dataloader.num_workers=8

Given these circumstances, I was wondering if there might be some room for optimization or if this is the expected speed considering the complexity of the task.

Also, could you provide details about the hardware you are using and the amount of time it typically takes for the diffusion policy to train on your setup? This could help me understand if what I am experiencing is within the expected range.

Looking forward to your insights.

Best Regards, @shim0114

cheng-chi commented 1 year ago

Hi @shim0114, All sim experiments are conducted on AWS g5 series instances (with specific instance size used depending on avaliability). I roughy rememeber that transport tasks require g5.16xlarge or larger instances. I don't remember the training speed for transport_mh, but it is possible that its training is slower than other tasks due to the increased number of image observations (2 -> 4). Our RobomimicReplayImageDataset uses Jpeg2k codec for in-memory compression of the dataset (using zarr), therefore your CPU could be the performance bottleneck. Could you verify your CPU and GPU ultilization?

cheng-chi commented 1 year ago

@shim0114 Bad news, it took us a month to run this experiment :(

cheng-chi commented 1 year ago

@shim0114 If you are interested getting different metrics, you might want to checkout our experiment logs: https://diffusion-policy.cs.columbia.edu/data/experiments/image/transport_mh/diffusion_policy_cnn/

XPLearner commented 1 month ago

one month for one training exp?

real-stanford / diffusion_policy

Slow Diffusion Policy Performance on Transport MH Dataset (Image Input) #4