Closed shim0114 closed 1 year ago
Hi @shim0114, All sim experiments are conducted on AWS g5 series instances (with specific instance size used depending on avaliability). I roughy rememeber that transport tasks require g5.16xlarge or larger instances. I don't remember the training speed for transport_mh, but it is possible that its training is slower than other tasks due to the increased number of image observations (2 -> 4). Our RobomimicReplayImageDataset uses Jpeg2k codec for in-memory compression of the dataset (using zarr), therefore your CPU could be the performance bottleneck. Could you verify your CPU and GPU ultilization?
@shim0114 Bad news, it took us a month to run this experiment :(
@shim0114 If you are interested getting different metrics, you might want to checkout our experiment logs: https://diffusion-policy.cs.columbia.edu/data/experiments/image/transport_mh/diffusion_policy_cnn/
one month for one training exp?
Dear @cheng-chi,
I would like to bring to your attention a performance issue I've encountered when working with the Transport MH dataset (image input) in robomimic. In particular, the performance of the diffusion policy seems to be significantly slower than expected.
Here are the details of my setup:
The command I use to run the training process is as follows:
Given these circumstances, I was wondering if there might be some room for optimization or if this is the expected speed considering the complexity of the task.
Also, could you provide details about the hardware you are using and the amount of time it typically takes for the diffusion policy to train on your setup? This could help me understand if what I am experiencing is within the expected range.
Looking forward to your insights.
Best Regards, @shim0114