InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding - Githubissues

paperswithlove / papers-we-read

3 stars 0 forks source link

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding #10

Open runhani opened 6 months ago

runhani commented 6 months ago

https://arxiv.org/abs/2403.15377 https://github.com/OpenGVLab/InternVideo2/

Video를 Understanding 하는 모델을 만들기 위해서는 뭐가 필요할까?

결국 학습할 data가 필요하다!
Caption을 만들기 위해

Video Captioner
Audio Captioner
Speech Captioner

똑똑하게 만들기 위해서는 처음에는 쉬운 label 부터 점차 어려운 label로 (커리큘럼에 따라서)

Video Encoder

전체 video의 8 frames만 보고, 각 frame당 14x14 patches를 뽑는다.
이게 말이되나? 이래도 성능이 나온다고?

그래서 진짜 쓸만하냐고?

결론

[x] video 입력은 시기 상조일까? 다가온 미래일까? 미리 준비해야 하는 것일까? 그것이 문제로다...