paperswithlove / papers-we-read

3 stars 0 forks source link

MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens #20

Open runhani opened 5 months ago

runhani commented 5 months ago

Overview

  1. MiniGPT4로 유명한 KAUST (사우디의 킹압둘라과기대)에서 나온 Video입력 MLLM

Architecture

image

  1. vision feature는 EVA-CLIP (freezed)
  2. video frames 들과 subtitle (자막) 을 동시에 넣고 마지막 입력 sequence 끝에 질문 or Intstruction text를 LLM의 입력으로
<s>
    <INST>
        <Img><FrameFeature_1><Sub><Subtitletext_1>... <Img> <FrameFeature_N><Sub><Subtitletext_N><Instruction>
    </INST>
</s>