mlzxy / devit

CoRL 2024
https://mlzxy.github.io/devit
MIT License
353 stars 46 forks source link

Details of Hardware used for "Real Robot Experiment" #62

Open retazo0018 opened 1 month ago

retazo0018 commented 1 month ago

HI @mlzxy ,

Could you please mention the hardware spec. of the system used for the "Real Robot Experiment" in the paper? Since that produced instant results, it would be helpful to know and reciprocate the same in our use-case too.

Many Thanks,

Best,

mlzxy commented 1 month ago

Hi,

Thanks for reaching out. Honestly the hardware spec. does not matter as long as you have RGBD camera 😂 because the hardware experiment is just a pick and place task. I will explain a little more.

First, we have an empty space with clean background, and use SAM to automatically extract the mask (by clicking certain fixed locations). Then, extract prototypes based on the given mask, and run detection on a new scene. All objects of that class will be picked sequentially. The pick and place procedure only requires a proper grasp pose, which you can generate from the object point cloud (cropped by the instance segmentation mask) with GPG https://github.com/atenpas/gpg or more advanced tools. The grasp pose does not depend on the robot arm. The prototype extraction and detection part is the same as in the YCB demo.

Unfortunately I don’t have plans to organize and release that part of the code though.

Best, Xinyu

On Oct 9, 2024 at 8:57:55 AM, Ashwin Murali @.***> wrote:

HI @mlzxy https://github.com/mlzxy ,

Could you please mention the hardware spec. of the system used for the "Real Robot Experiment" in the paper? Since that produced instant results, it would be helpful to know and reciprocate the same in our use-case too.

Many Thanks,

Best,

— Reply to this email directly, view it on GitHub https://github.com/mlzxy/devit/issues/62, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2OBPXHYUOQBTSMRKTRMQLZ2UR5HAVCNFSM6AAAAABPUPL3EKVHI2DSMVQWIX3LMV43ASLTON2WKOZSGU3TKOBWGIZTSNA . You are receiving this because you were mentioned.Message ID: @.***>

retazo0018 commented 1 month ago

Thanks @mlzxy for your reply,

I tried the vit-l open vocabulary model on my custom dataset and it produced good results with around 0.8FPS in Jetson Orin device.

I'm looking for 5-10 FPS result for my application. Is it possible to obtain this rate of output? If not, could you point me in a right direction and I will experiment a bit in my time.

Thanks,

Best,

mlzxy commented 1 month ago

To improve speed, I suggest these things:

  1. Use smaller TOP_K, this will greatly reduces memory cost and also inference speed (this does not require retraining)
  2. Use fp16, by using the torch.amp.autocast. I only use it for backbone at this moment, but it shall be possible to applied on the whole network (this does not require retraining)
  3. Retrain the model with my refactored implementation (as noted in the README), and uses ViT-small at the same time. The refactored implementation produces a better performance overall with a slightly higher memory cost during training.

Best, Xinyu

On Oct 21, 2024 at 6:24:04 AM, Ashwin Murali @.***> wrote:

Thanks @mlzxy https://github.com/mlzxy for your reply,

I tried the vit-l open vocabulary model on my custom dataset and it produced good results with around 0.8FPS in Jetson Orin device.

I'm looking for 5-10 FPS result for my application. Is it possible to obtain this rate of output? If not, could you point me in a right direction and I will experiment a bit in my time.

Thanks,

Best,

— Reply to this email directly, view it on GitHub https://github.com/mlzxy/devit/issues/62#issuecomment-2426258870, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2OBPXFSXFHH2H7K6AQKCTZ4TI4JAVCNFSM6AAAAABPUPL3EKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMRWGI2TQOBXGA . You are receiving this because you were mentioned.Message ID: @.***>