pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
377 stars 69 forks source link

pymupdf4llm worse than pymypdf on multi-column case. pymupdf4llm merges columns alone sentences. #113

Closed pseudotensor closed 1 month ago

pseudotensor commented 2 months ago

Tabasco_Ingredients_Products_Guide.pdf

pymypdf4llm blends the 3 columns:


Crushed pepper blend of A coarse, pulp-like A pungent blend of aged red
seed and skin with visible consistency with less sweet, peppers fermented with salt and
particulate identity. fermented and vinegar notes. mixed with distilled vinegar for a

tomato paste-like consistency.

Scoville: 61000–71000 Scoville: 55000–75000 Scoville: 25000–40000 Moisture Level: 50.0–55.0% Moisture Level: 45.0–60.0% Moisture Level: 70.0–80.0%

PRODUCT PRODUCT PRODUCT
APPLICATIONS APPLICATIONS APPLICATIONS

-  Pickled items -  Baked goods -  Soup & sauce bases

-  Ground & processed -  Liquid beverages -  Stewed items
meats

**DRY FORMULATIONS**

**CRUSHED RED PEPPER** **DRY RED FLAVORING**

A more flavorful substitute for traditional crushed red pepper Crushed pepper powder prepared from aged pepper
with strong spicy notes from aged pepper seeds and skins. mash—screened, dried and milled, for 10 times the heat of

TABASCO[®] Original Red Sauce.

Scoville: 60000–130000 Scoville: 73500–101500 Moisture: < 10.0% Moisture: 10.0% maximum

PRODUCT APPLICATIONS PRODUCT APPLICATIONS

-  Oils & extracts -  Breadings

-  Spice blends -  Meat seasonings

**ORIGINAL RED SPRAY DRY FLAVORING** **CHIPOTLE SPRAY DRY FLAVORING**

Fine particles with a smooth, refined consistency that Fine particles with a smooth, refined consistency that
delivers flavor before heat without adding moisture. delivers flavor before heat without adding moisture.

Scoville: 2500–7500 Scoville: 2500–7500 Moisture: 10.0% maximum Moisture: 10.0% maximum

PRODUCT APPLICATIONS PRODUCT APPLICATIONS

-  Dairy -  Dairy

-  Seasonings & rubs -  Seasonings & rubs

pymupdf does fine in that sense:

Crushed pepper blend of 
seed and skin with visible 
particulate identity.
PRODUCT 
APPLICATIONS 
• Pickled items
• 
Ground & processed 
meats
Scoville: 61000–71000
Moisture Level: 50.0–55.0% 
CRUSHED RED PEPPER
A more flavorful substitute for traditional crushed red pepper 
with strong spicy notes from aged pepper seeds and skins.
PRODUCT APPLICATIONS 
• Oils & extracts
• Spice blends
Scoville: 60000–130000
Moisture: < 10.0%
DRY RED FLAVORING
Crushed pepper powder prepared from aged pepper 
mash—screened, dried and milled, for 10 times the heat of 
TABASCO® Original Red Sauce.
PRODUCT APPLICATIONS 
• Breadings 
• Meat seasonings
Scoville: 73500–101500
Moisture: 10.0% maximum
ORIGINAL RED SPRAY DRY FLAVORING
Fine particles with a smooth, refined consistency that 
delivers flavor before heat without adding moisture.
PRODUCT APPLICATIONS 
• Dairy
• Seasonings & rubs
Scoville: 2500–7500
Moisture: 10.0% maximum
PEPPER PASTE
A pungent blend of aged red 
peppers fermented with salt and 
mixed with distilled vinegar for a 
tomato paste-like consistency.
PRODUCT 
APPLICATIONS 
• Soup & sauce bases 
• Stewed items
Scoville: 25000–40000
Moisture Level: 70.0–80.0%
GROUND WET SEED
A coarse, pulp-like 
consistency with less sweet, 
fermented and vinegar notes.
PRODUCT 
APPLICATIONS 
• Baked goods
• Liquid beverages 
Scoville: 55000–75000
Moisture Level: 45.0–60.0%
DRY FORMULATIONS
INTERMEDIATE MOISTURE FORMULATIONS
CHIPOTLE SPRAY DRY FLAVORING
Fine particles with a smooth, refined consistency that 
delivers flavor before heat without adding moisture.
PRODUCT APPLICATIONS 
• Dairy
• Seasonings & rubs
Scoville: 2500–7500
Moisture: 10.0% maximum
JorjMcKie commented 2 months ago

In contrast to PyMuPDF text extraction (which ignores all non-text stuff), PyMuPDF4LLM tries to make sense of all page content: images, vector graphics, tables and text, and works its way around non-text elements, identifies any tables and merges them with non-table text. All that is done without first OCRing the page and doing an a-priori layout analysis based on this (like many other packages do). If a document like yours contains a mixture of all these object types, some of them in addition being background, chances are high that you confuse that logic.

You have to decide what you want first and then choose the adequate way of extracting.

pseudotensor commented 2 months ago

Ok but pymypudf does fine, and even if I start the copy mouse drag over the text, it's clear what parts of text are together and pymupydf4llm violates this.

JorjMcKie commented 2 months ago

An example may help you better understand. The following script first removes all images from every page and then extracts the remains (text and vector graphics). Also disabling all header detection logic. Header play no role in this example. This should look better.

from pathlib import Path
import pymupdf
import pymupdf4llm

# first remove all images from all pages
doc = pymupdf.open("input.pdf")
for page in doc:
    page.add_redact_annot(page.rect)
    page.apply_redactions(
        images=pymupdf.PDF_REDACT_IMAGE_REMOVE,
        graphics=pymupdf.PDF_REDACT_LINE_ART_NONE,
        text=pymupdf.PDF_REDACT_TEXT_NONE,
    )
# extract markdown from cleaned file
md = pymupdf4llm.to_markdown(doc, hdr_info=False)
Path(doc.name.replace(".pdf", ".md")).write_bytes(md.encode())
pseudotensor commented 2 months ago

It's better for that particular issue, still issue with the the "DRY RED FLAVORING" is somehow chopped up and dispersed among others.

**PROCESSOR’S BLEND**

Crushed pepper blend of
seed and skin with visible
particulate identity.

Scoville: 61000–71000 Moisture Level: 50.0–55.0%

PRODUCT
APPLICATIONS

-  Pickled items

-  Ground & processed
meats

**GROUND WET SEED**

A coarse, pulp-like
consistency with less sweet,
fermented and vinegar notes.

Scoville: 55000–75000 Moisture Level: 45.0–60.0%

PRODUCT
APPLICATIONS

-  Baked goods

-  Liquid beverages

**PEPPER PASTE**

A pungent blend of aged red
peppers fermented with salt and
mixed with distilled vinegar for a
tomato paste-like consistency.

Scoville: 25000–40000 Moisture Level: 70.0–80.0%

PRODUCT
APPLICATIONS

-  Soup & sauce bases

-  Stewed items

**DRY FORMULATIONS**

**DRY RED FLAVORING**

**CRUSHED RED PEPPER**

A more flavorful substitute for traditional crushed red pepper
with strong spicy notes from aged pepper seeds and skins.

Scoville: 60000–130000 Moisture: < 10.0%

PRODUCT APPLICATIONS

-  Oils & extracts

-  Spice blends

**ORIGINAL RED SPRAY DRY FLAVORING**

Fine particles with a smooth, refined consistency that
delivers flavor before heat without adding moisture.

Scoville: 2500–7500 Moisture: 10.0% maximum

PRODUCT APPLICATIONS

-  Dairy

-  Seasonings & rubs

Crushed pepper powder prepared from aged pepper
mash—screened, dried and milled, for 10 times the heat of
TABASCO[®] Original Red Sauce.

Scoville: 73500–101500 Moisture: 10.0% maximum

PRODUCT APPLICATIONS

-  Breadings

-  Meat seasonings

**CHIPOTLE SPRAY DRY FLAVORING**

Fine particles with a smooth, refined consistency that
delivers flavor before heat without adding moisture.

Scoville: 2500–7500 Moisture: 10.0% maximum

PRODUCT APPLICATIONS

-  Dairy

-  Seasonings & rubs