sghong977 commented 3 months ago

Vision Transformer Adapter for Dense Predictions

Info.

Summary

plain ViT
- which is prone to work poorly due to the lack of inductive bias & weak prior assumption
- To achieve general-purpose model, transformer structure is essential for masked data modeling and multi-modal pre-training
- but vision-specific models are stronger than transformers... -> adapter can be a solution
adapter: train on large-scale multi-modal data
- after training is done, no need to fine-tune for downstream tasks
transformers for each of the various vision tasks
- such as instance, semantic, panoptic segmentation, visual grounding, detection...
achieve SOTA without using external dataset
- 😮 is the performance in light of the strong adapter already pre-trained on a multimodal dataset?
- => The author mentions this point. the author compared models under the fair pre-training strategy

Questions before reading the paper

is the "adapter" concept the same as NLP's? https://intelligentcm.tistory.com/340
- Yes! The author refers to the NLP's adapter paper in the introduction section.
- eg., object detection on COCO val2017.
github에 flash attention을 적용한다는 말이 있던데, 요즘 이 키워드 자주 보인다. 이건 뭐지?
- 설명 link 이거 읽어보니까 그냥 연산 효율적으로 하려고 만든 기법이다. chatGPT, Bard에게 물어보니 (당연하지만) 연산 결과는 인반 attention과 똑같다.

sghong977 commented 3 months ago

What's special about vision adaptor?

it is a general-purpose model from multi-modal knowledge which entails more flexibility, composed of:

(1) a spatial prior module for capturing the local semantics (spatial prior) from input images
(2) a spatial feature injector for incorporating spatial prior into the ViT
(3) a multi-scale feature extractor to reconstruct the multi-scale features required by dense prediction tasks.

Q. Why the adapter part is apart from the main ViT model?

to inject the image prior without redesigning the architecture of ViT.
They can supplement the missing local information and reorganize fine-grained multi-scale features for dense prediction tasks.

sghong977 commented 3 months ago

Related Works

Transformers

ViT: 2020 released
PVT, Swin: add CNN's inductive bias by adopting a pyramid structure
Conformer: the first CNN + transformer combined model
MAE, BEiT: enables self-supervised learning by masked image modeling (MIM)

Decoders for ViT

The architecture for dense prediction commonly follows an encoder-decoder pattern, in which the encoder generates rich features and the decoder aggregates and translates them to the final predictions
SETR: ViT backbone (encoder), CNN decoder for semantic segmentation
Segmenter: similar but it equips a transformer-based decoder
DPT: using ViT + CNN decoder for monocular depth estimation

adapter

adapter is widely used for NLPs
with the advent of CLIP, many CLIP-based adapters were presented to transfer knowledge to zero-shot or few-shot downstream tasks
ViTDet: employs upsampling & downsampling modules to adapt the plain ViT for obj detection

sghong977 commented 3 months ago

Model Structure

걍 이거면 설명이 됨

starts from spatial prior module which contains multiscale information of input image
- 물론 뒤에 inject/extractor 연산을 위해 tokenize 해야한다. 각각 flatten & concat 했다.
repeat inject/extract -> finally, refined feature pyramid can be obtained from adapter!

Q. why training-free?

Segmentation을 예로 들어서 생각해보자.

segmentation model의 backbone을 ViT-T,S,B 각각 사용하려고 한다 치자. 그러면 DeiT에서 (ViT knowledge distillation 논문임) 공개한 imagenet1k pretrained weight를 그대로 가져온다.
adapter부분 연산은 그냥 매번 random init인듯? (확실 X) 근데 이것도 transformer 구조인데 안 불러오나?
decoder도 마찬가지로 task specific하게 들어가야하는 부분
음?
- general-purpose 강력한 인코더를 만드는게 이 논문의 목적이라는 식으로 인트로에 적었던데... 그게 ViT에 저장된다는건가?
- 실제로 실험에서 multi-modal을 언급한건 이거다. ViT pretrained weight을 이 방법으로 했을때 성능이 크게 올랐단다. "Uni-perceiver: Pre-training unified architecture for generic perception for zero-shot and few-shot tasks."
- 이 논문의 contribution이 어디까지인지..? 여기서 한걸가지고 마치 vit adapter의 이점인것처럼 말하는건 좀
- 근데 이것도 자기들이 새로 학습한게 아니라 그냥 weight 가져온거네. contribution이 그러면 adapter 추가한게 끝인가? NLP의 adapter가 원래 어떤 개념인지 알아야겠다
- 아 그냥 ViT finetuning 하지말고 adapter를 붙인 후 finetuning하면 성능이 오른다. 이건데. 그럼 걍 parameter 개수 늘어나서 성능 올라간거 아냐?
- 궁금한거. ViT를 저 논문처럼 multimodal weight 가져온 다음에, adapter만 빼고 학습하면? 다른 모델들은 vanilla ViT 구조가 아니니 weight 재활용을 못한다 쳐도.

Uni-Perceiver pretrain 방식

Ablation study로 넘어가자

1. ViT vs ViT-Adapter feature

아래의 기존에 밝혀진 특성에 따라, ViT-adapter는 어떤지 푸리에변환을 통해 분석 -> vit-adapter는 더 high frequency를 배웠으니 CNN처럼 high-freq 정보를 얻을 수 있다. 뭐 그런 주장.

ViT presents the characteristics of learning low-frequency global signals
CNN tends to extract high-frequency information (e.g., local edges and textures)

위에 주절주절 적었던 저런 의문 때문에 더 adapter 역할을 증명하려고 하는 자료인듯...

2. attention 방법에 대한 비교도 있다

각 component 빠진거에 대한 ablation은 안가져왔다. 파라미터 많아졌으니 당연히 좋아지겠지 뭘...

이건 adapter와 vit의 파라미터수 비교이다.

sghong977 commented 3 months ago

아무튼 난 segmentation에 쓸건데 BEiT와 Mask2Former를 사용했다길래 이거 뭔지도 봐야한다.

1. Mask2Former

이 논문은 자세히 읽으면 재밌을것같은데 시간이 없으니.. 일단 훑었다. 나중에..

Masked attention mask transformer for universal image segmentation. (CVPR 22)
https://github.com/facebookresearch/Mask2Former
Task: segmentation => backone보다는 역시 뒷단에 contribution이 있다는 얘기
contributions
- instance, semantic, panoptic segmentation in one model
- using masked attention; which extracts localized features by constraining cross-attention within predicted mask regions
- the slow convergence of Transformer-based models is due to global context in the cross-attention layer, as it takes many training epochs for cross-attention to learn to attend to localized object regions
- query에 대해서는 딱히 global location 전부 다 필요한게 아니라 마스크 부분만 봐도 괜찮을거란 가정. (only attends within the forground region)
- 그러면 그 foreground mask 어떻게 하는거지? query로 들어가는 영역은 어떻게 정하는가? -> 당연히 object를 사용해야함 -> learnable하게 했음. 그랬더니 decoder 거친게 아님에도 불구하고 object스러운 mask query가 생성되는 것 같다.

2. recent multi-modal pre-training BEiTv2

Beit v2: Masked image modeling with vector-quantized visual tokenizers. arXiv preprint arXiv:2208.06366, 2022

https://velog.io/@_chominseo/%EB%85%BC%EB%AC%B8-%EB%A6%AC%EB%B7%B0-BEIT-v2-Masked-Image-Modeling-with-Vector-Quantized-Visual-Tokenizers
이 논문 자체는 multimodal 아님. 걍 Masked Autoencoder 비슷한건데 좀더 잘 학습하는 방법 제안해서 더 좋은 backbone 만들었다는 소리일것같다
그러면 Uni-Perceiver 논문 나오면서 BEiTv2 방식으로 multi-modal 학습한건가?

3. Uni-Perceiver

https://arxiv.org/pdf/2112.01522

엄청 대충 봤는데, 그러면 궁금한거

하나의 input에 video, image, text 다 들어간다. 같은 의미인가, 아니면 아예 다른 것들을 무작위로 넣는가?
cosine similarity를 가지고 학습한다는데 정확히 뭔지.. x와 y의 관계가? 어떤 pair들로 학습하는거지? joint probability distribution 계산하고 log likelihood maximize 하는건데

오 이거면 이해가 된다.

sghong977 commented 3 months ago

결론

요즘 segmentation은 장난 없다. 너무 크고 복잡한 모델을 피하기 위해서 SOTA를 사용하는 대신, 일부러 text model 짬뽕되지 않았으면서 나름 SOTA 반열에 있는 모델로 가져온건데 하나의 요소 기술을 만들기 위해서 엄청난 것들이 집약되어있음을 알 수 있었다.

먼저, 인코더단은 보통 generalized 잘된 general-purpose를 추구하기 때문에 최신 기술이 집약된 것을 쓰고자한다. 이 논문은 ViT구조를 그대로 썼다. Vision을 위해 task-specific하게 또 새로운 구조를 만든다면 어느 누군가가 엄청 큰 규모의 학습해서 백본 공개해준걸 활용하기 어려울 테니까. 실제로, 그냥 ViT들은 detection, segmentation같은 local prior가 중요한 vision task에서 원래는 잘 안되는데, 단순히 adapter를 붙임으로써 CNNs처럼 high frequency 학습이 가능함을 보여줬다. 그래서 어떤 최근 ViT backbone을 가져왔느냐 하면, BEiTv2와 Uni-Perciever같은걸 예로 들 수 있었다. BEiT는 Masked Autoencoder의 MIM 학습 방식을 좀더 아이디어 붙여 고도화한 아이디어인데, backbone을 학습하는 방법에 대한 논문인듯하다. 이 논문 자체는 multi-modal이 아니다. Uni-Perceiver는 ViT 학습하는데에 multimodal들이 전부 하나의 representation space에 있도록하는 방법을 제시하는 논문이다. 아무튼 DeiT, AugReg, BEiT, Uni-Perceiver, BEiTv2 등 다양한 backbone을 효과적으로 활용할 수 있는 것 같다.

Adapter는 파라미터수가 비교적 많지 않아서 그냥 downstream task finetuning 할때 새로 붙여서 학습하는 용도이다. NLP에서 원래 많이 쓴다는데 비전에 가져왔다. 어댑터 구조는 간단하니 걍 넘어가겠다...

디코더는 task specific하다. 여담이지만 SAM finetuning 할때도 encoder는 그대로 두고 decoder만 학습하길 권장한다고 한다. 그래서 기존에 segmentation model에서 많이 쓰는 것들을 활용한다. UperNet의 경우 약간 오래된 논문이지만 SwinTransformer가 나오면서 swin+upernet이 괜찮은 성능을 냈었어서 사용한 것 같고, 최근에 나온 Mask2Former도 붙여봤다. masked attention 아이디어가 신기했다.

아무튼 뭔가.. 기존에 있는 것들을 잘 활용한데다가 논리를 잘 만든 논문으로 보인다.

sghong977 commented 3 months ago

그리고 이미 Uni-Perceiver v2 논문이 나온 것 같다. CVPR23. "Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks"

원논문에서는 vision task들 따로 안 건드리고 image-txt-video 어떻게 잘 학습하느냐에 초점이 된 느낌... 그래서 task들도 retrieval같은건가 그런데 여긴 엄청 다양해졌다.

현재 SOTA도 generalized model이다. 앞에 모델과 다르게 오디오까지 들어간다. 지금 보니까 확실히 앞에 이런 흐름 모르면 이런 최신논문 못읽을것같다.... 이 논문은 ICLR 2024 리젝당한 흔적이 있다.

이쯤되면 리젝 사유가 궁금하다...... 벌써 인용도 많이 되었던데 1) the model architecture is the same as prior work such as VLMO which does not bring new findings or insights; 2) the paper highlights the method can generalize to unlimited modalities but only evaluates on three modalities. The rebuttal did not address these concerns well. Therefore, the AC recommends rejection.

sghong977 commented 3 months ago

아 걍 간단하게... 지금 ViT-Adapter finetuning중이라 가볍게 본건데 이게 뭔. 줄줄이 딸려나왔다

sghong977 commented 3 months ago

InternViT라는게 있다.

InternVL scales up the ViT to 6B parameters and aligns it with LLM.
그놈의 VLP... 그놈의 foundation model...
https://github.com/OpenGVLab/InternVL-MMDetSeg

이걸 들고온 이유는 ViT-Adapter또한 여기서 지원되기 때문. 물론 지금 나는 속도도 중요해서 여기까지 가지 않을 것 같다....

sghong977 commented 3 months ago

근데 논문에서 계속 pretraining free adapter를 finetuning에 학습한다, ViT backbone을 architecture 없이 수정 가능하다 <- 이렇게만 말을 쓴거 봐서 ViT backbone을 학습 안해도 된다, 고정해도 된다 이런 소리는 아닌 것 같다. 이부분이 미심쩍어서 논문이랑 코드 체크해봐도... 예를들면 ViT-adapter의 beit backbone 코드에서 requires_grad=False처리된거 이거 하나다. 원래 Default가 true일텐데.. chatGPT한테 물어보면 backbone은 training free라는 식으로 대답해서 의문스럽긴 하다

sghong977 / Daily_AIML

[Survey, 논문 리뷰] ViT-Adapter, flash attention, ...... #40

Vision Transformer Adapter for Dense Predictions

Summary

Questions before reading the paper

What's special about vision adaptor?

Related Works

Transformers

Decoders for ViT

adapter

Model Structure

1. ViT vs ViT-Adapter feature

2. attention 방법에 대한 비교도 있다

1. Mask2Former

2. recent multi-modal pre-training BEiTv2

3. Uni-Perceiver

결론