junyuan-fang / Vision-Language-on-3D-Scene-Understanding

MIT License
1 stars 0 forks source link

Papers to be read #3

Open junyuan-fang opened 1 year ago

junyuan-fang commented 1 year ago

● Recent works which leverage the large-scale image-text pairs pre-training such as CLIP shows promising performance in classification, segmentation and depth estimation. ● How to transfer the pretraining knowledge for 3D understanding such as referring point cloud segmentation has been barely explored .

CLIP: https://arxiv.org/abs/2104.04687 https://www.youtube.com/watch?v=OZF1t_Hieq8

DenseCLIP https://arxiv.org/abs/2112.01518

Image

Image

Image

junyuan-fang commented 1 year ago

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision https://www.youtube.com/watch?v=6pzBOQAXUB8