Idefics3 : Building and better understanding vision-language models: insights and future directions - Githubissues

paperswithlove / papers-we-read

3 stars 0 forks source link

Idefics3 : Building and better understanding vision-language models: insights and future directions #46

Open runhani opened 2 months ago

runhani commented 2 months ago

Some Links

New Dataset : Docmatrix

https://huggingface.co/datasets/HuggingFaceM4/Docmatix
240만장, 950만 QA pairs from 130만 PDFs

생각보다 많은 단계...

아니... 그래서 뭣이 중한디?

아직도 춘추 전국 시대인 VLM architecture와 방법론들...
ViT hidden states를 하나의 word로 보는게 아니라 concate하지를 않나.
Llama 3-V에서는 flamingo에서 쓰이던 cross-attention을 들고 나오지를 않나

Self-Attention (BLIP-2, ...) vs Cross-Attention (Flamingo, ...)

일단 cross-attention이 적어도 3배 이상 더 많은 parameters

Metric	Self-Attention Architecture	Cross-Attention Architecture
Total Parameters	8.3B	10B
Newly Initialized (Million)	740	2500

그래서 결론은??

?!?