Closed pseudotensor closed 1 month ago
In contrast to PyMuPDF text extraction (which ignores all non-text stuff), PyMuPDF4LLM tries to make sense of all page content: images, vector graphics, tables and text, and works its way around non-text elements, identifies any tables and merges them with non-table text. All that is done without first OCRing the page and doing an a-priori layout analysis based on this (like many other packages do). If a document like yours contains a mixture of all these object types, some of them in addition being background, chances are high that you confuse that logic.
You have to decide what you want first and then choose the adequate way of extracting.
Ok but pymypudf does fine, and even if I start the copy mouse drag over the text, it's clear what parts of text are together and pymupydf4llm violates this.
An example may help you better understand. The following script first removes all images from every page and then extracts the remains (text and vector graphics). Also disabling all header detection logic. Header play no role in this example. This should look better.
from pathlib import Path
import pymupdf
import pymupdf4llm
# first remove all images from all pages
doc = pymupdf.open("input.pdf")
for page in doc:
page.add_redact_annot(page.rect)
page.apply_redactions(
images=pymupdf.PDF_REDACT_IMAGE_REMOVE,
graphics=pymupdf.PDF_REDACT_LINE_ART_NONE,
text=pymupdf.PDF_REDACT_TEXT_NONE,
)
# extract markdown from cleaned file
md = pymupdf4llm.to_markdown(doc, hdr_info=False)
Path(doc.name.replace(".pdf", ".md")).write_bytes(md.encode())
It's better for that particular issue, still issue with the the "DRY RED FLAVORING" is somehow chopped up and dispersed among others.
**PROCESSOR’S BLEND**
Crushed pepper blend of
seed and skin with visible
particulate identity.
Scoville: 61000–71000 Moisture Level: 50.0–55.0%
PRODUCT
APPLICATIONS
- Pickled items
- Ground & processed
meats
**GROUND WET SEED**
A coarse, pulp-like
consistency with less sweet,
fermented and vinegar notes.
Scoville: 55000–75000 Moisture Level: 45.0–60.0%
PRODUCT
APPLICATIONS
- Baked goods
- Liquid beverages
**PEPPER PASTE**
A pungent blend of aged red
peppers fermented with salt and
mixed with distilled vinegar for a
tomato paste-like consistency.
Scoville: 25000–40000 Moisture Level: 70.0–80.0%
PRODUCT
APPLICATIONS
- Soup & sauce bases
- Stewed items
**DRY FORMULATIONS**
**DRY RED FLAVORING**
**CRUSHED RED PEPPER**
A more flavorful substitute for traditional crushed red pepper
with strong spicy notes from aged pepper seeds and skins.
Scoville: 60000–130000 Moisture: < 10.0%
PRODUCT APPLICATIONS
- Oils & extracts
- Spice blends
**ORIGINAL RED SPRAY DRY FLAVORING**
Fine particles with a smooth, refined consistency that
delivers flavor before heat without adding moisture.
Scoville: 2500–7500 Moisture: 10.0% maximum
PRODUCT APPLICATIONS
- Dairy
- Seasonings & rubs
Crushed pepper powder prepared from aged pepper
mash—screened, dried and milled, for 10 times the heat of
TABASCO[®] Original Red Sauce.
Scoville: 73500–101500 Moisture: 10.0% maximum
PRODUCT APPLICATIONS
- Breadings
- Meat seasonings
**CHIPOTLE SPRAY DRY FLAVORING**
Fine particles with a smooth, refined consistency that
delivers flavor before heat without adding moisture.
Scoville: 2500–7500 Moisture: 10.0% maximum
PRODUCT APPLICATIONS
- Dairy
- Seasonings & rubs
Tabasco_Ingredients_Products_Guide.pdf
pymypdf4llm blends the 3 columns:
Scoville: 61000–71000 Scoville: 55000–75000 Scoville: 25000–40000 Moisture Level: 50.0–55.0% Moisture Level: 45.0–60.0% Moisture Level: 70.0–80.0%
Scoville: 60000–130000 Scoville: 73500–101500 Moisture: < 10.0% Moisture: 10.0% maximum
Scoville: 2500–7500 Scoville: 2500–7500 Moisture: 10.0% maximum Moisture: 10.0% maximum
pymupdf does fine in that sense: