mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

hjeun commented 8 months ago

0. Summary

mPLUG 시리즈 중 하나로 Text-Rich Image 타겟 (Document, Webpage, Table, Chart, Natural Image)
Adaptive Crop (UReader) + Multimodality-Adaptive Module (Owl2) + H-Reducer (Proposed)
H-Reducer는 1x4 Conv.를 써서 Text에 맞는 Feat. 생성과 Visual Token 줄이는 목적
위 5가지 도메인에 대해 데이터셋 구성 (Public 데이터셋 이용)
각 도메인에 맞게 Structure (Annotation) 정의하여 효율적으로 학습

1. mPLUG Series (from Alibaba)

2. Target Domain

3. Keypoints

4. Architecture

5. Unified Structure Learning

Document, Table, Chart, Text Rec. Text Grounding 테스크에 대해 Parsing Rule 정의
Document Parsing: pdfplumber lib. 사용. 텍스트만 나열. SynthDoG와 유사.
Table Parsing: Markdown Format, Multiple rows, columns 정의를 위해 special token 추가 ('', '')
Chart Parsing: Markdown Format
Text Recognition: 0~999 좌표. word, phrase, line, block level 정의.
Text Grounding: 0~999 좌표. word, phrase, line, block level 정의.

DocStruct4M
H-Reducer
- 1x4 Conv.로 문서에 Text가 쓰여있는 형태를 고려한 Layer
Multimodality-Adaptive Module (mPlug-Owl2)
- Image와 Text Self-Attention을 따로 하는 형태

JihoonJ commented 8 months ago

ㅋㅋㅋㅋ 저도 작성하러 들어왔는데 미리 작성된 내용이 똭!! 몇 가지 같이 공유 드립니다.

Models
- Visual Encoder: ViT/L-14, 448x448, output 1024 sequence
- Adapter: H-Reducer, 1-layer cnn + MLP, output 256 sequence
  - cnn: 1x4 kernel, 1x4 stride
Ablation
- 과연 H-Reducer는 2x2 conv 대비 성능은 어떤가? --> VQA 측면에서 좋긴 하지만 월등하지는 않음
- 과연 H-Reducer는 2x2 conv 대비 성능은 어떤가? --> OCR 측면에선 꽤 개선이 있음

JihoonJ commented 8 months ago

현준님께 예전에 공유 드리긴 했지만, H-Reducer의 1x4 kernel과 같은 효과를 C-Adapter에 적용하여 학습 중에 있습니다. 결과 나오면 공유 드릴께요!

paperswithlove / papers-we-read