Kosmos-2.5 - Image-to-markdown generation for images outside the sample-set provided is almost entirely garbled - output markdown is completely unusable.

Describe the bug Model I am using: Kosmos-2.5

The problem arises when using:

[ x ] the official example scripts: Using the exact required custom-libraries and dependencies to run the supplied inference.py script. Same results obtained when run bare-metal or when extended via a simple Flask-API in a containerized environment: https://github.com/abgulati/kosmos-2_5-containerized

Description: Image-to-markdown generation for images outside the sample-set provided is almost entirely garbled - output markdown is completely unusable.

Elaborating in the examples below:

Example 1 - Using the sample in.png example image provided with the model:

Sample-1-in

On running the inference.py script with --do_md for image-to-markdown generation:

1-in-png-response

Isolating the results:

2-in-png-response-isolated

Cleaning the results:

3-in-png-response-cleaned

Perfect markdown output as rendered via https://markdownlivepreview.com/:

4-in-png-markdown-preview

This confirms the model is working correctly!

Example 2 - Table from a Boeing manual:

Sample-2-Boeing

Output of inference.py script with --do_md for image-to-markdown generation:

5-Boeing-Backgrounder-Response

Copying, cleaning and generating a markdown preview of the results - completely garbled & unusable output:

6-Boeing-Backgrounder-Markdown-Preview

Example 3 - Table of network connectors from my notes for the CompTIA Network+ exam:

Sample-3-Connectors

Output of inference.py script with --do_md for image-to-markdown generation:

7-Connectors-Reponse

Copying, cleaning and generating a markdown preview of the results - completely garbled & unusable output:

8-Connectors-Markdown-Preview

Example 4 - Table of commons ports and services from my notes for the CompTIA Network+ & Security+ exams:

Sample-4-Ports

Output of inference.py script with --do_md for image-to-markdown generation:

9-Ports-Response

Copying, cleaning and generating a markdown preview of the results - completely garbled & unusable output:

10-Ports-Markdown-Response

As demonstrated by these examples, markdown-generation for images outside the sample (training?) set is completely garbled and unusable. The first example establishes the model itself is working correctly.

Further, --do_ocr works perfectly and outputs high-accuracy, high-quality data.

To Reproduce Steps to reproduce the behavior:

Run model for markdown generation: python3 inference.py --do_md --image_path/image.png -- ckpt ckpt.pt

Expected behavior Respectably accurate markdown generation

Platform: WSL Ubuntu 22.04
Python version: v3.10.12
PyTorch version (GPU?): 2.5.0.dev20240705+cu124 for RTX 3090
Detailed system specs:

Intel Core i9 13900KF
Nvidia RTX 3090FE
32GB DDR5 5600MT/s (16x2)
Windows 11 - OS Build 22631.3737
CUDA 12.4

Flash-Attention-2 (v2.5.9.post1)
tiktoken 0.7.0
tqdm 4.66.4
omegaconf 2.0.6 (hydra-core 1.0.7)
boto3 1.34.140
iopath 0.1.10
fairscale 0.4.0
scipy 1.10.0
triton 2.3.1
https://github.com/facebookresearch/xformers.git@04de99bb28aa6de8d48fab3cdbbc9e3874c994b8
https://github.com/Dod-o/kosmos2.5_tools.git@fairseq
https://github.com/Dod-o/kosmos2.5_tools.git@infinibatch
https://github.com/Dod-o/kosmos2.5_tools.git@torchscale
https://github.com/Dod-o/kosmos2.5_tools.git@transformers

microsoft / unilm

Kosmos-2.5 - Image-to-markdown generation for images outside the sample-set provided is almost entirely garbled - output markdown is completely unusable. #1602