qizekun / ShapeLLM

[ECCV 2024] ShapeLLM: Universal 3D Object Understanding for Embodied Interaction
https://qizekun.github.io/shapellm/
Apache License 2.0
110 stars 8 forks source link

The ShapeLLM only support point clouds input? #10

Closed cocoshe closed 1 month ago

cocoshe commented 1 month ago

Thanks for your great work~ I have some difficulties when reading the source code

Q1: The original version of RECON support the multimodal input(pts, img, text), but the RECON++ seems don't support the multi view image as input?

The original RECON: https://github.com/qizekun/ReCon/blob/main/models/ReCon.py#L275-L322

Q2: And where can I find the Hungarian Algorithm code for multi view images in the src code?

qizekun commented 1 month ago

Hi,

A1: Both ReConV1 and ReConV2 use multimodal input(pts, img, text), ReConV1 use single-view image while ReConV2 use multi view images, https://github.com/qizekun/ShapeLLM/blob/main/ReConV2/datasets/OpenShape.py.

A2: The Hungarian algorithm is a bipartite matching algorithm, and we directly utilized the function from the scipy library.https://github.com/qizekun/ShapeLLM/blob/main/ReConV2/models/ReCon.py#L285.

cocoshe commented 1 month ago

Hi,

A1: Both ReConV1 and ReConV2 use multimodal input(pts, img, text), ReConV1 use single-view image while ReConV2 use multi view images, https://github.com/qizekun/ShapeLLM/blob/main/ReConV2/datasets/OpenShape.py.

A2: The Hungarian algorithm is a bipartite matching algorithm, and we directly utilized the function from the scipy library.https://github.com/qizekun/ShapeLLM/blob/main/ReConV2/models/ReCon.py#L285.

Really thanks for your reply~

In my point of view, so there are two different states of RECON++:

First, use the 3D point clouds and (multi-view) images, text information to train a RECON++, which is a 3D encoder for the MLLM paradigm, and use the timm/eva_large_patch14_336.in22k_ft_in1k as base model. Besides, the RECON++ use some learnable queries including img queries and text_queries as assistants to utilize the 2D pts and text infomation in datasets to enhance the 3D information.(use the multi-view 2D loss and text loss instead of only using the 3D pts reconstruction loss by MAE)

BTW, the base VIT eva_large_patch14_336.in22k_ft_in1k seems don't equipped with CrossBlocks(maybe only support the Block for RECON++?), So the RECON++ train the CrossBlocks from scratch?

Then, It is the MLLM part, freeze the 3D encoder params, and process the inference function of MaskTransformer(The encoder), and at this time, only the point cloud data is supported since the inference function only support the pts input param:

https://github.com/qizekun/ShapeLLM/blob/d4f59e7feaf39e1ea1042134a00aacabbdb11392/ReConV2/models/ReCon.py#L139

However, the 3D encoder gets the better performance because of the additional training data(multi-view 2D data and text data) when training the RECON++

Can you tell me if there is any deviation in my understanding? Thanks a lot!

qizekun commented 1 month ago

Hi, Yes, we believe that MAE pretraining can achieve local geometric understanding, while cross-modal pretraining with teacher distillation can acquire semantic understanding.