TextMonkey : An OCR-Free Large Multimodal Model for Understanding Document

스크린샷 2024-03-12 오전 11 39 57

TextMonkey is a large multimodal model tailored for text-centric tasks like document question answering and scene text analysis.
introduces innovative approaches like Shifted Window Attention and Token Resampler to enhance performance.
The model excels in various benchmarks, showcasing improvements in accuracy and interpretability.
Its ability to understand and interact with text-heavy content sets it apart from existing models.
Readers can explore TextMonkey's capabilities for document understanding and consider its potential applications in tasks requiring text analysis and comprehension.

슬라이딩 윈도우 모듈을 사용해서 입력 이미지를 겹치지 않는 패치로 나누는 것으로 시작
- 각 패치의 크기는 448x448 이고, 이러한 패치는 14x14 픽셀의 작은 패치로 세분화하고, 각 패치는 토큰으로 간주
pre-trained CLIP 모델로부터 출발한 Transformer 블록을 활용해서 각 윈도우 패치에서 토큰을 개별적으로 처리
다양한 패치들의 연결성을 위해 Shifted Window Attention 을 활용
계층적인 표현을 위해 입력 이미지를 448x448 로 리사이즈하여 CLIP 모델에 넣어서 Global Feature 추출
여기서 추출한 Global Feature 는 하위의 이미지들의 특징과 함께 공유되어 이미지 리샘플러에 의해 처리되어 Language Domain 에서 처리
Token Resampler 를 사용해서 토큰의 길이를 압축함으로서 언어 공간의 중복성을 최소화하여 결과를 얻어냄
- 중요한 토큰을 식별하고, 중복 토큰을 제거하기 위해, 각 이미지 토큰과 나머지 다른 토큰들과의 Cosine Similarity 의 최대값(CMX)을 얻어내고,
- 1- CMX 를 계산한 다음에, 여기서 top r개의 token 만 선택하여 언어모델의 공급으로 활용

To assess the redundancy of image features, we measure the similarity of image tokens already mapped to the language space. We randomly select 20 ordered features after the image resampler and compare pairwise similarities using cosine similarity, as shown in Fig. 3.

Figure 4: Quantitative analysis on specific redundant tokens. Using the maximum cosine similarity between each token and other tokens as a criterion for identifying redundant tokens, we plotted the threshold on the x-axis and the number of redundant tokens at different resolutions on the y-axis.

However, how can we identify important tokens and eliminate redundant ones? We have observed that certain tokens are highly unique and lack closely similar counterparts, such as the fourth token in Fig. 3.
This suggests that this token is distinct. We hypothesize that these tokens carry crucial and distinctive information, which is further validated in subsequent experiments. Therefore, we utilize similarity as a metric to identify significant tokens.
Based on the reduction of the token count, our module can also significantly improve the performance compared to random queries.

In the “Trans” mode, text is considered correct if the answer contains this word. Conversely, the “Pos” mode requires the consideration of positional information in accordance with the Mango

paperswithlove / papers-we-read