pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
494 stars 78 forks source link

0.0.17 seems to output no text #150

Closed dentro-innovation closed 1 month ago

dentro-innovation commented 1 month ago

This is my output with 0.0.17:

-----

-----

-----

-----

-----

-----

-----

-----

-----

-----

-----

-----

-----

-----

-----

-----

-----

-----

-----

-----

-----

This is my output with 0.0.16 (just first page):

# N4-MPXH
#### MANUAL DE INSTALADOR

 Manuales  Residencias

### DESCRIPCIÓN GENERAL

Central de alarma de 4 zonas, ampliable a 32 (8 particiones de 4 zonas), con zócalos plug-in integrados para la conexión de equipos inalámbricos

y de comunicación. Códigos personales para 30 usuarios (expandible a 240) y registrador de eventos.

Características

-  4 zonas MPXH y cableadas, más dos zonas de pánico y sabotaje.

-  Protección ante eventos de robo, asalto, incendio, pánico, sabotaje y

emergencias médicas

-  Compatible con toda la línea MPXH y sensores convencionales

-  Operable desde toda la línea de teclados de X-28 Alarmas y transmisores Imprimir

remotos para residencias

-  Zócalos para módulos plug-in

-  Capacidad para almacenar hasta 30 usuarios y códigos personales

-  Compatible con la app Mi Alarma X-28 mediante los comunicadores de la línea

WIFICOM, WIFICEL y COM30-MPXH

### IDENTIFICACIÓN DE LAS PARTES

1 Tapa de la central

2 Tornillos de fijación de la tapa

-----

I'm using python 3.12.2 on ubuntu.

dentro-innovation commented 1 month ago

Same behavior with python 3.10.12, no text in the output file with 0.0.17 but expected output with 0.0.16

However, when setting write_images=True in 0.0.17, the images are referenced and the image extraction is far better than with 0.0.16 which extracts whole pages as images.

Output with 0.0.17 and write_images=True:

![](input.pdf-0-0.png)

-----

![](input.pdf-1-0.png)

-----

![](input.pdf-2-0.png)

-----

![](input.pdf-3-0.png)

![](input.pdf-3-1.png)

-----

![](input.pdf-4-0.png)

![](input.pdf-4-1.png)

-----

![](input.pdf-5-0.png)

![](input.pdf-5-1.png)

-----

![](input.pdf-6-0.png)

![](input.pdf-6-1.png)

![](input.pdf-6-2.png)

-----

-----

-----

-----

-----

-----

-----

-----

-----

-----

-----

-----

![](input.pdf-18-0.png)

-----

![](input.pdf-19-0.png)

![](input.pdf-19-1.png)

-----

-----
JorjMcKie commented 1 month ago

Where is the reproducing file please?

dentro-innovation commented 1 month ago

My bad forgot.

Btw the behavior is the same on localhost and on an ubuntu server that I just tried.

PDF in question:

input.pdf

Also got more PDFs of the same manual provider which don't work properly if you want to test with more PDFs

JorjMcKie commented 1 month ago

We are now deliberately ignoring text with a smaller font size than 3. Do you need such stuff?

dentro-innovation commented 1 month ago

Oh I see.

Well yes I'd need it for that use case. I got this PDF by "printing" it from this website: https://manuales.x-28.com/m/N4-MPXH/1/instalador.html Maybe such small font occurs often when printing websites in such fashion?

Perhaps the user can decide until which font size pymupdf4llm should export text?

JorjMcKie commented 1 month ago

Before we jump to conclusions: Here is a script that does print text. Maybe that the margins value play the major role:

import pathlib

import pymupdf
import pymupdf4llm

doc = pymupdf.open("input.pdf")
md = pymupdf4llm.to_markdown(
    doc,
    margins=0,
)
pathlib.Path(doc.name + ".md").write_bytes(md.encode())
dentro-innovation commented 1 month ago

Before I give confirmations: I have no idea how PyMuPDF works under the hood apart from running the basic commands.

This script seems to work marvelously! It does add a bit too much whitespace in front of a sentence, but it works on 0.0.17 !