Closed cocoshe closed 3 months ago
Hi,
A1: Both ReConV1 and ReConV2 use multimodal input(pts, img, text), ReConV1 use single-view image while ReConV2 use multi view images, https://github.com/qizekun/ShapeLLM/blob/main/ReConV2/datasets/OpenShape.py.
A2: The Hungarian algorithm is a bipartite matching algorithm, and we directly utilized the function from the scipy library.https://github.com/qizekun/ShapeLLM/blob/main/ReConV2/models/ReCon.py#L285.
Hi,
A1: Both ReConV1 and ReConV2 use multimodal input(pts, img, text), ReConV1 use single-view image while ReConV2 use multi view images, https://github.com/qizekun/ShapeLLM/blob/main/ReConV2/datasets/OpenShape.py.
A2: The Hungarian algorithm is a bipartite matching algorithm, and we directly utilized the function from the scipy library.https://github.com/qizekun/ShapeLLM/blob/main/ReConV2/models/ReCon.py#L285.
Really thanks for your reply~
In my point of view, so there are two different states of RECON++
:
First, use the 3D point clouds and (multi-view) images, text information to train a RECON++
, which is a 3D encoder for the MLLM paradigm, and use the timm/eva_large_patch14_336.in22k_ft_in1k
as base model. Besides, the RECON++
use some learnable queries including img queries
and text_queries
as assistants to utilize the 2D pts and text infomation in datasets to enhance the 3D information.(use the multi-view 2D loss and text loss instead of only using the 3D pts reconstruction loss by MAE)
BTW, the base VIT
eva_large_patch14_336.in22k_ft_in1k
seems don't equipped withCrossBlocks
(maybe only support theBlock
forRECON++
?), So theRECON++
train theCrossBlocks
from scratch?
Then, It is the MLLM part, freeze the 3D encoder params, and process the inference
function of MaskTransformer
(The encoder), and at this time, only the point cloud data is supported since the inference
function only support the pts
input param:
However, the 3D encoder gets the better performance because of the additional training data(multi-view 2D data and text data) when training the RECON++
Can you tell me if there is any deviation in my understanding? Thanks a lot!
Hi, Yes, we believe that MAE pretraining can achieve local geometric understanding, while cross-modal pretraining with teacher distillation can acquire semantic understanding.
Thanks for your great work~ I have some difficulties when reading the source code
Q1: The original version of RECON support the multimodal input(pts, img, text), but the RECON++ seems don't support the multi view image as input?
The original RECON: https://github.com/qizekun/ReCon/blob/main/models/ReCon.py#L275-L322
Q2: And where can I find the
Hungarian Algorithm
code for multi view images in the src code?