long8v / PTIR

Paper Today I Read
19 stars 0 forks source link

๐Ÿ… ์งญ์งค์ด ๋…ผ๋ฌธ ๋ชจ์•„๋†“๊ธฐ (CLIP) #165

Closed long8v closed 1 month ago

long8v commented 3 months ago

๊ฑฐ์˜ scheming๋งŒ ํ–ˆ๋˜ ๋…ผ๋ฌธ ๋ชจ์•„๋†“๋Š” ๊ณณ. notion์— ์ •๋ฆฌ์ค‘์ด์—ˆ์œผ๋‚˜ link๋ฅผ ๊ฑธ๊ธฐ๊ฐ€ ์–ด๋ ค์›Œ์„œ ์˜ฎ๊น€.

long8v commented 3 months ago

SIEVE: MULTIMODAL DATASET PRUNING USING IMAGE CAPTIONING MODELS

https://arxiv.org/pdf/2310.02110.pdf meta ๋…ผ๋ฌธ ์บก์…˜ ๋ชจ๋ธ - Sentence Similarity(all-MiniLM-L6)์œผ๋กœ ํ•„ํ„ฐ๋งํ•˜๋Š” ์ ‘๊ทผ

image image
long8v commented 3 months ago

Mutual Information Divergence: A Unified Metric for Multimodal Generative Models

image

CLIP์— gt(=reference) image, caption ๋„ฃ์—ˆ์„ ๋•Œ์˜ score์˜ ๋ถ„ํฌ์™€ evaluating image, caption ๋„ฃ์—ˆ์„ ๋•Œ์˜ score ๋ถ„ํฌ๊ฐ„์˜ pointwise mutual infromation divergence๋กœ score๋ฅผ ๋งค๊ฒผ๋‹ค๋Š” ๋‚ด์šฉ~ ์†”์งํžˆ ์ดํ•ด๋Š” ๋ชปํ•จ

long8v commented 3 months ago

CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment

image

CLIP์—์„œ QA ํ˜•ํƒœ๋กœ ์ฃผ๋ฉด ๊ฑฐ์˜ ๋ชปํ•˜๊ณ  ์ด๊ฑธ rewrite ์‹œ์ผœ์„œ caption ํ˜•ํƒœ๋กœ ๋งŒ๋“ค๊ณ  ๊ฐ€์žฅ ์œ ์‚ฌ๋„ ๋†’์€๊ฑธ๋กœ ํ•˜๋‹ˆ๊นŒ ์ž˜ ํ’€๋ ธ๋‹ค๋Š” ๋…ผ๋ฌธ~

image
long8v commented 3 months ago

Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding

๋ณ„๊ฑฐ ์—†์—ˆ์Œ..! ๊ฑ mask๋œ ๊ณณ์—๋‹ค๊ฐ€ possible word ๋„ฃ๊ณ  ๊ฐ€์žฅ ๋†’์€๊ฑธ๋กœ ์˜ˆ์ธกํ•˜๋Š” ํ˜•์‹

image
long8v commented 3 months ago

Parrot Captions Teach CLIP to Spot Text

image image

https://arxiv.org/pdf/2312.14232.pdf

laion-2b์—์„œ clip score๊ฐ€ ๊ฐ€์žฅ ๋†’์€ text-image pair๋ฅผ ์ฐ์–ด๋ดค๋Š”๋ฐ ๋‹ค ๊ธ€์ž๊ฐ€ ์žˆ๋Š” ๊ทธ๋ฆผ์ด์—ˆ๋‹ค๊ณ  ํ•˜๋„ค์šฉ ๊ธ€์ž๊ฐ€ ๋“ค์–ด๊ฐ€๋Š” ์ด๋ฏธ์ง€ - ํ…์ŠคํŠธ ํŽ˜์–ด๊ฐ€ ๊ทธ๋ ‡์ง€ ์•Š์€ ๊ฒƒ๋“ค๋ณด๋‹ค clip score๊ฐ€ ๋†’์€ ๊ฒฝํ–ฅ์„ฑ์ด ์žˆ๊ณ , ๊ทธ๋ž˜์„œ clip score๋กœ ํ•„ํ„ฐ๋ง ํ•  ๋•Œ ์ด๋Ÿฐ bias๋ฅผ ์œ ์˜ํ•ด์•ผํ•œ๋‹ค~๋Š” ๋‚ด์šฉ์ž…๋‹ˆ๋‹น

long8v commented 3 months ago

COYO

image

https://github.com/kakaobrain/coyo-dataset

์–˜๋„ค๋Š” ์˜คํžˆ๋ ค ๋…ผ๋ฌธ์„ ์•ˆ๋ƒˆ์—ˆ๊ตฌ๋‚ญ..

image

multi-label ๋ถ„๋ฅ˜ ๋ฐ์ดํ„ฐ์…‹๋„ ๋ƒˆ๋Š”๋ฐ ImageNet-21k๋กœ ํ•™์Šตํ•œ ๋ชจ๋ธ๋กœ ๋งŒ๋“  machine labeled์ด๊ธด ํ•˜๋‚˜ imagenet ์„ฑ๋Šฅ์ด JFT-300M์ด๋ž‘ ๋น„์Šทํ•˜๊ฒŒ ๋‚˜์˜จ๋‹ค

image
long8v commented 3 months ago

DEMYSTIFYING CLIP DATA

long8v commented 3 months ago

FILIP

image

โ†’ early fusion์ด๋ผ๊ณ  ์–˜๊ธฐํ•˜๊ณ  ์ด๋ฅผ ํ†ตํ•ด fine-grained ํ•ด์ง„๋‹ค๋ฅผ ์–˜๊ธฐ

image image image

์ด๋•Œ ๋ฐœํ˜„๋˜๋Š” ๋Šฅ๋ ฅ์€ ๊ฐ ํ† ํฐ์— ๋Œ€ํ•ด similiarity๊ฐ€ ์•„๋ž˜์™€ ๊ฐ™์ด ์ดํ•ดํ•˜๊ธฐ ์‰ฌ์šด ๋ฐฉ์‹์œผ๋กœ ๋œ๋‹ค๋Š” ์ ์ด๋‹ค. โ†’ ์ด๊ฒŒ ๋ญ๊ฐ€ ์ข‹์€์ง€ ์ž˜ ๋ชจ๋ฅด๊ฒ ๋Š” ๋ถ€๋ถ„.. โ‡’ explainability ?!

image
long8v commented 3 months ago

Improving CLIP Training with Language Rewrites

https://arxiv.org/abs/2305.20088

์ง„์งœ ๋Œ€์ถฉ ์ฝ์–ด๋ดค๋Š”๋ฐ CC3M ๊ฐ™์€ caption๋งŒ ์ฃผ๊ณ  ๋‹ค์‹œ ์“ฐ๋ผ๊ณ  ํ•˜๋Š”๋“ฏ? ๋” ํ’๋ถ€ํ•œ ํ‘œํ˜„๊ณผ ๋‹ค์–‘์„ฑ์ด ๋Š˜์–ด๋‚œ๋‹ค๊ณ ?

image

์•„์˜ˆ ๊ฐˆ์•„ ๋ผ์šด๊ฑด ์•„๋‹ˆ๊ณ  ์—ฌ๋Ÿฌ๊ฐœ ๋งŒ๋“ค๊ณ  ๊ทธ๋•Œ๋งˆ๋‹ค sampling ํ•ด์„œ ์‚ฌ์šฉํ–ˆ๋‹ค๊ณ  ํ•จ

image

์„ฑ๋Šฅ์ด ๊ฝค ๋Š˜์–ด๋‚˜๋„ค..

long8v commented 3 months ago

ALIGN

https://arxiv.org/pdf/2102.05918.pdf ๋ณ„๋‹ค๋ฅธ Trick ์—†์ด CLIP loss๋กœ training corpus ๋งŽ์ด ๋Š˜๋ฆฌ๋ฉด ์ž˜ ๋œ๋‹ค

image image
long8v commented 3 months ago

Improving fine-grained understanding in image-text pre-training

fine-grained clip ๋ฅ˜

image image

FILIP๊ณผ ๋‹ค๋ฅธ ์ ์€ positive image-text pair ๋‚ด๋ถ€์—์„œ threshold๋กœ similarity๋ฅผ ์ž๋ฅธ ๋’ค์— ๊ทธ๊ฑธ๋กœ weight๋ฅผ ๊ตฌํ•˜๊ณ 

image

์–˜๋ฅผ vision feature๋ž‘ ๊ฐ€์ค‘ํ•ฉ์„ ํ•ด์„œ langague-grouped vision embedding์„ ๊ตฌํ•œ ๋’ค์— ์ด์— ๋Œ€ํ•ด Contrastive loss๋ฅผ ๊ฑธ์–ด์ฃผ๋Š”!

image

๋งŽ์ด ๋‹ค๋ฆ„! COCO retrieval์ด ๋งŽ์ด ์˜ค๋ฆ„

image

OV-OD (Owl-ViT์˜ ๋ฐฑ๋ณธ์œผ๋กœ ์‚ฌ์šฉํ•ด์„œ ์žผ)ํ–ˆ์„ ๋•Œ ์˜ฌ๋ž๋‹ค๊ณ  ํ•จ

image

๋‹ค๋งŒ ๋ฐ์ดํ„ฐ๋ฅผ ์—„์ฒญ ๋งŽ์ด ์”€ ใ…Žใ…Ž

image
long8v commented 2 months ago

์Šค๋ ˆ๋“œ๊ฐ€ ๋„ˆ๋ฌด ๊ธธ์–ด์ง€๋„ค.. CLIP ์—ฐ๊ตฌ๋Š” ์ฐธ ํ•ซํ•˜๊ตฌ๋‚˜..

Learn "No" to Say "Yes" Better: Improving Vision-Language Models via Negations (a.k.a. ConCLIP)

image image

CLIP์ด ๋ถ€์ •ํ‘œํ˜„์„ ๋ชปํ•ด์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”๊ฐ€ํ•ด์ค˜์„œ ๋„ฃ์–ด์คฌ๊ณ  Finegrained contrastive๋„ ๋„ฃ์–ด์คฌ๋‹ค๊ณ  ํ•จ

long8v commented 2 months ago

WHEN AND WHY VISION-LANGUAGE MODELS BEHAVE LIKE BAGS-OF-WORDS, AND WHAT TO DO ABOUT IT? (a.k.a NegCLIP)

CLIP์˜ contrastive loss์˜ ์„ฑ๊ฒฉ ์ƒ bag-of-words ์‹์œผ๋กœ ํ•™์Šต๋˜์—ˆ๋‹ค.

image

a.k.a ARO ์ œ์•ˆ https://github.com/mertyg/vision-language-models-are-bows hard negative๋กœ ํ•™์Šต๋œ NegCLIP์€ relation ๋“ฑ ์„ž์€ ๊ฒƒ๋“ค์„ ํ›จ์”ฌ ์ž˜ํ•œ๋‹ค๊ณ  ํ•จ.

image
long8v commented 2 months ago

Exposing and Mitigating Spurious Correlations for Cross-Modal Retrieval

image

CLIP spurious clue

long8v commented 2 months ago

CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data

image

CatLIP. imagenet์— ๋“ค์–ด์žˆ๋Š” class์™€ training data์— ๋“ค์–ด์žˆ๋Š” class์˜ ๋นˆ๋„์ˆ˜๋ฅผ ๋ถ„์„ํ•œ ๊ฒƒ ๊ฐ™๊ธธ๋ž˜ ๋ด„

image

๋ˆ„๊ฐ€ ์š”์•ฝํ•ด๋†จ๋„ค ใ…‹ใ…‹ https://devocean.sk.com/blog/techBoardDetail.do?ID=165861&boardType=techBlog ๊ฑ ํ…์ŠคํŠธ ์ธ์ฝ”๋” ์—†๋Š” CLIP์œผ๋กœ ๋ถ„๋ฅ˜๊ธฐ ๋งŒ๋“ค๊ธฐ ๋…ผ๋ฌธ์ž„

long8v commented 2 months ago

MoDE: CLIP Data Experts via Clustering

image

CLIP + meta learning ์Šค๋Ÿฌ์šด

image
long8v commented 1 month ago

When are Lemons Purple? The Concept Association Bias of Vision-Language Models

image

Object๊ฐ€ ์—ฌ๋Ÿฌ ๊ฐœ ์žˆ๋Š” image์— ๋Œ€ํ•ด A lemon is [MASK]๋กœ ํ•˜๋ฉด yellow๊ฐ€ ์•„๋‹ˆ๋ผ ์˜†์— ์žˆ๋Š” Eggplant์˜ ์ƒ‰์„ ๋ฝ‘๋Š”๋‹ค

image

์ด ์ด์œ ๋Š” ๊ทธ๋Ÿด๋“ฏํ•œ๋ฐ ใ…‹ใ…‹ lemon์—๋Š” ์ด๋ฏธ ๋…ธ๋ž€์ƒ‰์ด ์žˆ์œผ๋ฏ€๋กœ ๊ฐ€์žฅ ๊ฐ€๊น๊ฒŒ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด์„œ๋Š” ๊ฐ€์ง€์˜ ์ƒ‰์ธ ๋ณด๋ผ์ƒ‰์„ ๋„ฃ๋Š”๊ฒŒ ๊ฐ€์žฅ ์ข‹์€ ๊ฒƒ !

image

์ด๋Ÿฐ ๋ฐ์ดํ„ฐ์…‹์ด ์žˆ๋„น Natural-Color Dataset (NCD)