paperswithlove / papers-we-read

3 stars 0 forks source link

Idefics3 : Building and better understanding vision-language models: insights and future directions #46

Open runhani opened 2 months ago

runhani commented 2 months ago

Some Links

New Dataset : Docmatrix

생각보다 많은 단계...

image image

아니... 그래서 뭣이 중한디?

Self-Attention (BLIP-2, ...) vs Cross-Attention (Flamingo, ...)

Metric Self-Attention Architecture Cross-Attention Architecture
Total Parameters 8.3B 10B
Newly Initialized (Million) 740 2500

그래서 결론은??