paperswithlove / papers-we-read

3 stars 0 forks source link

DeepSeek-VL: Towards Real-World Vision-Language Understanding #1

Open JihoonJ opened 8 months ago

JihoonJ commented 8 months ago

image

Links

  1. Arxiv: https://arxiv.org/abs/2403.05525 (pdf)
  2. Code: https://github.com/deepseek-ai/DeepSeek-VL

Summary

  1. Dual Encoder를 사용하는 LMM 구조
  2. Architect, Dataset 구축, Training Pipeline, Ablation Study까지 포함된 잘 구성된 논문
  3. 다음 2가지 내용은 최근 그룹회의에서 고민했던 내용과 관련이 깊음
    • Vision Encoder 선택에 관한 짤막한 내용
    • Dual Encoder 구성 및 Adapter 구조에 대한 Ablation Study

Highlights

  1. Dual Encoder
    • Text-aligned Encoder for coarse semantic extraction(384, SigLIP-L)
    • High-resolution encoder that captures detailed visual information(1024, SAM-B)

image

  1. Dual Encoder Feature Concatenation
    • 384 --> 24 x 24 x 1024
    • 1024 --> 64 x 64 x 256 --> interpolation --> 96 x 96 x 256 --> 2 conv layers --> 24 x 24 x 1024
    • concatenation: 24 x 24 x 2048

image

  1. Vision-Language Adaptor (Hybrid)
    • Encoder 마다 MLP 1-Layer를 독립 적용 후 Channel-wise Concatenation 후 MLP 1-Layer 적용
    • 384 --> 24 x 24 x 1024 --> MLP1
    • 1024 -->24 x 24 x 1024 --> MLP2
    • Concat(MLP1, MLP2) --> 24 x 24 x 2048 --> MLP3

image

  1. Vision Encoder Selection (SPHINX-X 유사 결론)

image image

runhani commented 8 months ago

오호! 우리의 방향이랑 일치하는 군요!

hjeun commented 8 months ago

https://huggingface.co/facebook/sam-vit-base https://huggingface.co/facebook/sam-vit-large https://huggingface.co/facebook/sam-vit-huge