sghong977 / Daily_AIML

Computer Vision, Deep Learning, ๊ทธ์™ธ MLOps ์ฐ๋จน ๋“ฑ. ๋งค์ผ ์ƒˆ๋กญ๊ฒŒ ๋ฐฐ์šด ๊ฒƒ์„ ์ •๋ฆฌํ•ฉ๋‹ˆ๋‹ค.
0 stars 0 forks source link

[Survey, ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] ViT-Adapter, flash attention, ...... #40

Open sghong977 opened 3 months ago

sghong977 commented 3 months ago

Vision Transformer Adapter for Dense Predictions

Info.

Summary

Questions before reading the paper

sghong977 commented 3 months ago

What's special about vision adaptor?

it is a general-purpose model from multi-modal knowledge which entails more flexibility, composed of:

image Q. Why the adapter part is apart from the main ViT model?

sghong977 commented 3 months ago

Related Works

Transformers

Decoders for ViT

adapter

image

sghong977 commented 3 months ago

Model Structure

๊ฑ ์ด๊ฑฐ๋ฉด ์„ค๋ช…์ด ๋จ

Q. why training-free? image

Segmentation์„ ์˜ˆ๋กœ ๋“ค์–ด์„œ ์ƒ๊ฐํ•ด๋ณด์ž.

Uni-Perceiver pretrain ๋ฐฉ์‹ image


Ablation study๋กœ ๋„˜์–ด๊ฐ€์ž

1. ViT vs ViT-Adapter feature

์•„๋ž˜์˜ ๊ธฐ์กด์— ๋ฐํ˜€์ง„ ํŠน์„ฑ์— ๋”ฐ๋ผ, ViT-adapter๋Š” ์–ด๋–ค์ง€ ํ‘ธ๋ฆฌ์—๋ณ€ํ™˜์„ ํ†ตํ•ด ๋ถ„์„ -> vit-adapter๋Š” ๋” high frequency๋ฅผ ๋ฐฐ์› ์œผ๋‹ˆ CNN์ฒ˜๋Ÿผ high-freq ์ •๋ณด๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค. ๋ญ ๊ทธ๋Ÿฐ ์ฃผ์žฅ.

image ์œ„์— ์ฃผ์ ˆ์ฃผ์ ˆ ์ ์—ˆ๋˜ ์ €๋Ÿฐ ์˜๋ฌธ ๋•Œ๋ฌธ์— ๋” adapter ์—ญํ• ์„ ์ฆ๋ช…ํ•˜๋ ค๊ณ  ํ•˜๋Š” ์ž๋ฃŒ์ธ๋“ฏ...

2. attention ๋ฐฉ๋ฒ•์— ๋Œ€ํ•œ ๋น„๊ต๋„ ์žˆ๋‹ค

image


๊ฐ component ๋น ์ง„๊ฑฐ์— ๋Œ€ํ•œ ablation์€ ์•ˆ๊ฐ€์ ธ์™”๋‹ค. ํŒŒ๋ผ๋ฏธํ„ฐ ๋งŽ์•„์กŒ์œผ๋‹ˆ ๋‹น์—ฐํžˆ ์ข‹์•„์ง€๊ฒ ์ง€ ๋ญ˜...

์ด๊ฑด adapter์™€ vit์˜ ํŒŒ๋ผ๋ฏธํ„ฐ์ˆ˜ ๋น„๊ต์ด๋‹ค. image

sghong977 commented 3 months ago

์•„๋ฌดํŠผ ๋‚œ segmentation์— ์“ธ๊ฑด๋ฐ BEiT์™€ Mask2Former๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค๊ธธ๋ž˜ ์ด๊ฑฐ ๋ญ”์ง€๋„ ๋ด์•ผํ•œ๋‹ค. image

1. Mask2Former

์ด ๋…ผ๋ฌธ์€ ์ž์„ธํžˆ ์ฝ์œผ๋ฉด ์žฌ๋ฐŒ์„๊ฒƒ๊ฐ™์€๋ฐ ์‹œ๊ฐ„์ด ์—†์œผ๋‹ˆ.. ์ผ๋‹จ ํ›‘์—ˆ๋‹ค. ๋‚˜์ค‘์—..

image

image

2. recent multi-modal pre-training BEiTv2

Beit v2: Masked image modeling with vector-quantized visual tokenizers. arXiv preprint arXiv:2208.06366, 2022

3. Uni-Perceiver

์—„์ฒญ ๋Œ€์ถฉ ๋ดค๋Š”๋ฐ, ๊ทธ๋Ÿฌ๋ฉด ๊ถ๊ธˆํ•œ๊ฑฐ

  1. ํ•˜๋‚˜์˜ input์— video, image, text ๋‹ค ๋“ค์–ด๊ฐ„๋‹ค. ๊ฐ™์€ ์˜๋ฏธ์ธ๊ฐ€, ์•„๋‹ˆ๋ฉด ์•„์˜ˆ ๋‹ค๋ฅธ ๊ฒƒ๋“ค์„ ๋ฌด์ž‘์œ„๋กœ ๋„ฃ๋Š”๊ฐ€?
  2. cosine similarity๋ฅผ ๊ฐ€์ง€๊ณ  ํ•™์Šตํ•œ๋‹ค๋Š”๋ฐ ์ •ํ™•ํžˆ ๋ญ”์ง€.. x์™€ y์˜ ๊ด€๊ณ„๊ฐ€? ์–ด๋–ค pair๋“ค๋กœ ํ•™์Šตํ•˜๋Š”๊ฑฐ์ง€? joint probability distribution ๊ณ„์‚ฐํ•˜๊ณ  log likelihood maximize ํ•˜๋Š”๊ฑด๋ฐ

์˜ค ์ด๊ฑฐ๋ฉด ์ดํ•ด๊ฐ€ ๋œ๋‹ค. image

sghong977 commented 3 months ago

๊ฒฐ๋ก 

์š”์ฆ˜ segmentation์€ ์žฅ๋‚œ ์—†๋‹ค. ๋„ˆ๋ฌด ํฌ๊ณ  ๋ณต์žกํ•œ ๋ชจ๋ธ์„ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด์„œ SOTA๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋Œ€์‹ , ์ผ๋ถ€๋Ÿฌ text model ์งฌ๋ฝ•๋˜์ง€ ์•Š์•˜์œผ๋ฉด์„œ ๋‚˜๋ฆ„ SOTA ๋ฐ˜์—ด์— ์žˆ๋Š” ๋ชจ๋ธ๋กœ ๊ฐ€์ ธ์˜จ๊ฑด๋ฐ ํ•˜๋‚˜์˜ ์š”์†Œ ๊ธฐ์ˆ ์„ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด์„œ ์—„์ฒญ๋‚œ ๊ฒƒ๋“ค์ด ์ง‘์•ฝ๋˜์–ด์žˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ์—ˆ๋‹ค.

๋จผ์ €, ์ธ์ฝ”๋”๋‹จ์€ ๋ณดํ†ต generalized ์ž˜๋œ general-purpose๋ฅผ ์ถ”๊ตฌํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ตœ์‹  ๊ธฐ์ˆ ์ด ์ง‘์•ฝ๋œ ๊ฒƒ์„ ์“ฐ๊ณ ์žํ•œ๋‹ค. ์ด ๋…ผ๋ฌธ์€ ViT๊ตฌ์กฐ๋ฅผ ๊ทธ๋Œ€๋กœ ์ผ๋‹ค. Vision์„ ์œ„ํ•ด task-specificํ•˜๊ฒŒ ๋˜ ์ƒˆ๋กœ์šด ๊ตฌ์กฐ๋ฅผ ๋งŒ๋“ ๋‹ค๋ฉด ์–ด๋Š ๋ˆ„๊ตฐ๊ฐ€๊ฐ€ ์—„์ฒญ ํฐ ๊ทœ๋ชจ์˜ ํ•™์Šตํ•ด์„œ ๋ฐฑ๋ณธ ๊ณต๊ฐœํ•ด์ค€๊ฑธ ํ™œ์šฉํ•˜๊ธฐ ์–ด๋ ค์šธ ํ…Œ๋‹ˆ๊นŒ. ์‹ค์ œ๋กœ, ๊ทธ๋ƒฅ ViT๋“ค์€ detection, segmentation๊ฐ™์€ local prior๊ฐ€ ์ค‘์š”ํ•œ vision task์—์„œ ์›๋ž˜๋Š” ์ž˜ ์•ˆ๋˜๋Š”๋ฐ, ๋‹จ์ˆœํžˆ adapter๋ฅผ ๋ถ™์ž„์œผ๋กœ์จ CNNs์ฒ˜๋Ÿผ high frequency ํ•™์Šต์ด ๊ฐ€๋Šฅํ•จ์„ ๋ณด์—ฌ์คฌ๋‹ค. ๊ทธ๋ž˜์„œ ์–ด๋–ค ์ตœ๊ทผ ViT backbone์„ ๊ฐ€์ ธ์™”๋Š๋ƒ ํ•˜๋ฉด, BEiTv2์™€ Uni-Perciever๊ฐ™์€๊ฑธ ์˜ˆ๋กœ ๋“ค ์ˆ˜ ์žˆ์—ˆ๋‹ค. BEiT๋Š” Masked Autoencoder์˜ MIM ํ•™์Šต ๋ฐฉ์‹์„ ์ข€๋” ์•„์ด๋””์–ด ๋ถ™์—ฌ ๊ณ ๋„ํ™”ํ•œ ์•„์ด๋””์–ด์ธ๋ฐ, backbone์„ ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•œ ๋…ผ๋ฌธ์ธ๋“ฏํ•˜๋‹ค. ์ด ๋…ผ๋ฌธ ์ž์ฒด๋Š” multi-modal์ด ์•„๋‹ˆ๋‹ค. Uni-Perceiver๋Š” ViT ํ•™์Šตํ•˜๋Š”๋ฐ์— multimodal๋“ค์ด ์ „๋ถ€ ํ•˜๋‚˜์˜ representation space์— ์žˆ๋„๋กํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•˜๋Š” ๋…ผ๋ฌธ์ด๋‹ค. ์•„๋ฌดํŠผ DeiT, AugReg, BEiT, Uni-Perceiver, BEiTv2 ๋“ฑ ๋‹ค์–‘ํ•œ backbone์„ ํšจ๊ณผ์ ์œผ๋กœ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ ๊ฐ™๋‹ค.

Adapter๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ์ˆ˜๊ฐ€ ๋น„๊ต์  ๋งŽ์ง€ ์•Š์•„์„œ ๊ทธ๋ƒฅ downstream task finetuning ํ• ๋•Œ ์ƒˆ๋กœ ๋ถ™์—ฌ์„œ ํ•™์Šตํ•˜๋Š” ์šฉ๋„์ด๋‹ค. NLP์—์„œ ์›๋ž˜ ๋งŽ์ด ์“ด๋‹ค๋Š”๋ฐ ๋น„์ „์— ๊ฐ€์ ธ์™”๋‹ค. ์–ด๋Œ‘ํ„ฐ ๊ตฌ์กฐ๋Š” ๊ฐ„๋‹จํ•˜๋‹ˆ ๊ฑ ๋„˜์–ด๊ฐ€๊ฒ ๋‹ค...

๋””์ฝ”๋”๋Š” task specificํ•˜๋‹ค. ์—ฌ๋‹ด์ด์ง€๋งŒ SAM finetuning ํ• ๋•Œ๋„ encoder๋Š” ๊ทธ๋Œ€๋กœ ๋‘๊ณ  decoder๋งŒ ํ•™์Šตํ•˜๊ธธ ๊ถŒ์žฅํ•œ๋‹ค๊ณ  ํ•œ๋‹ค. ๊ทธ๋ž˜์„œ ๊ธฐ์กด์— segmentation model์—์„œ ๋งŽ์ด ์“ฐ๋Š” ๊ฒƒ๋“ค์„ ํ™œ์šฉํ•œ๋‹ค. UperNet์˜ ๊ฒฝ์šฐ ์•ฝ๊ฐ„ ์˜ค๋ž˜๋œ ๋…ผ๋ฌธ์ด์ง€๋งŒ SwinTransformer๊ฐ€ ๋‚˜์˜ค๋ฉด์„œ swin+upernet์ด ๊ดœ์ฐฎ์€ ์„ฑ๋Šฅ์„ ๋ƒˆ์—ˆ์–ด์„œ ์‚ฌ์šฉํ•œ ๊ฒƒ ๊ฐ™๊ณ , ์ตœ๊ทผ์— ๋‚˜์˜จ Mask2Former๋„ ๋ถ™์—ฌ๋ดค๋‹ค. masked attention ์•„์ด๋””์–ด๊ฐ€ ์‹ ๊ธฐํ–ˆ๋‹ค.

์•„๋ฌดํŠผ ๋ญ”๊ฐ€.. ๊ธฐ์กด์— ์žˆ๋Š” ๊ฒƒ๋“ค์„ ์ž˜ ํ™œ์šฉํ•œ๋ฐ๋‹ค๊ฐ€ ๋…ผ๋ฆฌ๋ฅผ ์ž˜ ๋งŒ๋“  ๋…ผ๋ฌธ์œผ๋กœ ๋ณด์ธ๋‹ค.

image

sghong977 commented 3 months ago

๊ทธ๋ฆฌ๊ณ  ์ด๋ฏธ Uni-Perceiver v2 ๋…ผ๋ฌธ์ด ๋‚˜์˜จ ๊ฒƒ ๊ฐ™๋‹ค. CVPR23. "Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks"

์›๋…ผ๋ฌธ์—์„œ๋Š” vision task๋“ค ๋”ฐ๋กœ ์•ˆ ๊ฑด๋“œ๋ฆฌ๊ณ  image-txt-video ์–ด๋–ป๊ฒŒ ์ž˜ ํ•™์Šตํ•˜๋Š๋ƒ์— ์ดˆ์ ์ด ๋œ ๋Š๋‚Œ... ๊ทธ๋ž˜์„œ task๋“ค๋„ retrieval๊ฐ™์€๊ฑด๊ฐ€ ๊ทธ๋Ÿฐ๋ฐ ์—ฌ๊ธด ์—„์ฒญ ๋‹ค์–‘ํ•ด์กŒ๋‹ค. image

ํ˜„์žฌ SOTA๋„ generalized model์ด๋‹ค. ์•ž์— ๋ชจ๋ธ๊ณผ ๋‹ค๋ฅด๊ฒŒ ์˜ค๋””์˜ค๊นŒ์ง€ ๋“ค์–ด๊ฐ„๋‹ค. ์ง€๊ธˆ ๋ณด๋‹ˆ๊นŒ ํ™•์‹คํžˆ ์•ž์— ์ด๋Ÿฐ ํ๋ฆ„ ๋ชจ๋ฅด๋ฉด ์ด๋Ÿฐ ์ตœ์‹ ๋…ผ๋ฌธ ๋ชป์ฝ์„๊ฒƒ๊ฐ™๋‹ค.... ์ด ๋…ผ๋ฌธ์€ ICLR 2024 ๋ฆฌ์ ๋‹นํ•œ ํ”์ ์ด ์žˆ๋‹ค.

์ด์ฏค๋˜๋ฉด ๋ฆฌ์  ์‚ฌ์œ ๊ฐ€ ๊ถ๊ธˆํ•˜๋‹ค...... ๋ฒŒ์จ ์ธ์šฉ๋„ ๋งŽ์ด ๋˜์—ˆ๋˜๋ฐ 1) the model architecture is the same as prior work such as VLMO which does not bring new findings or insights; 2) the paper highlights the method can generalize to unlimited modalities but only evaluates on three modalities. The rebuttal did not address these concerns well. Therefore, the AC recommends rejection.

sghong977 commented 3 months ago

์•„ ๊ฑ ๊ฐ„๋‹จํ•˜๊ฒŒ... ์ง€๊ธˆ ViT-Adapter finetuning์ค‘์ด๋ผ ๊ฐ€๋ณ๊ฒŒ ๋ณธ๊ฑด๋ฐ ์ด๊ฒŒ ๋ญ”. ์ค„์ค„์ด ๋”ธ๋ ค๋‚˜์™”๋‹ค

sghong977 commented 3 months ago

InternViT๋ผ๋Š”๊ฒŒ ์žˆ๋‹ค.

์ด๊ฑธ ๋“ค๊ณ ์˜จ ์ด์œ ๋Š” ViT-Adapter๋˜ํ•œ ์—ฌ๊ธฐ์„œ ์ง€์›๋˜๊ธฐ ๋•Œ๋ฌธ. ๋ฌผ๋ก  ์ง€๊ธˆ ๋‚˜๋Š” ์†๋„๋„ ์ค‘์š”ํ•ด์„œ ์—ฌ๊ธฐ๊นŒ์ง€ ๊ฐ€์ง€ ์•Š์„ ๊ฒƒ ๊ฐ™๋‹ค....

sghong977 commented 3 months ago

๊ทผ๋ฐ ๋…ผ๋ฌธ์—์„œ ๊ณ„์† pretraining free adapter๋ฅผ finetuning์— ํ•™์Šตํ•œ๋‹ค, ViT backbone์„ architecture ์—†์ด ์ˆ˜์ • ๊ฐ€๋Šฅํ•˜๋‹ค <- ์ด๋ ‡๊ฒŒ๋งŒ ๋ง์„ ์“ด๊ฑฐ ๋ด์„œ ViT backbone์„ ํ•™์Šต ์•ˆํ•ด๋„ ๋œ๋‹ค, ๊ณ ์ •ํ•ด๋„ ๋œ๋‹ค ์ด๋Ÿฐ ์†Œ๋ฆฌ๋Š” ์•„๋‹Œ ๊ฒƒ ๊ฐ™๋‹ค. ์ด๋ถ€๋ถ„์ด ๋ฏธ์‹ฌ์ฉ์–ด์„œ ๋…ผ๋ฌธ์ด๋ž‘ ์ฝ”๋“œ ์ฒดํฌํ•ด๋ด๋„... ์˜ˆ๋ฅผ๋“ค๋ฉด ViT-adapter์˜ beit backbone ์ฝ”๋“œ์—์„œ requires_grad=False์ฒ˜๋ฆฌ๋œ๊ฑฐ ์ด๊ฑฐ ํ•˜๋‚˜๋‹ค. ์›๋ž˜ Default๊ฐ€ true์ผํ…๋ฐ.. image chatGPTํ•œํ…Œ ๋ฌผ์–ด๋ณด๋ฉด backbone์€ training free๋ผ๋Š” ์‹์œผ๋กœ ๋Œ€๋‹ตํ•ด์„œ ์˜๋ฌธ์Šค๋Ÿฝ๊ธด ํ•˜๋‹ค