issues
search
paperswithlove
/
papers-we-read
3
stars
0
forks
source link
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
#27
Open
hjeun
opened
6 months ago
hjeun
commented
6 months ago
https://arxiv.org/abs/2404.16821
Closed source 모델보다 우수
성능 개선의 중요한 3가지 요소
Strong Vision Encoder: InterViT-6B
Dynamic High Resolution: Tile 4K based on 448x448 and preserve aspect ratio
High-Quality Bilingual Dataset
InterViT-6B (Continuous pre-training)
InternViT-6B: 224, CLIP 학습
InternViT-6B-448px-V1.2: 448, Nous-Hermes-2-Yi-34B 붙여서 학습
InternViT-6B-448px-V1.5: 448 Tiles ranging from 1 to 12, InternLM2-20B 붙여서 학습
InternLM2-20B
Dynamic High-Resolution
Dynamic Aspect Ratio Matching
Aspect Ration Preserving
Training visual tokens from 256 to 3,328, Testing visual tokens maximum of 10,496
Pixel Shuffle이 뭐지?
Context Length 4096
OCR datasets 많이 넣음
PaddleOCR 사용
EN dataset translation to ZH
GPT 사용
tile 평균 24가 성능 제일 좋음
https://arxiv.org/abs/2404.16821