pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
302 stars 57 forks source link

A title with various font sizes #159

Closed Fianax closed 20 hours ago

Fianax commented 1 day ago

Hello again. I wanted to tell you about this new index behavior.

In the version of pdf4llm==0.0.7 when in an index there were two different font sizes as in this pdf:

prueba_indice_distinto_tamaño.pdf

the result was this:

# 1 **hello**

which was a result I thought was correct but in the pdf4llm==0.0.9 version it gives the following (attached result in markdown):

prueba_indice_distinto_tamaño_new_markdown.md

It also makes sense, I think, but this shape breaks the index because it divides it in two parts. Also, I imagine that if the title were longer and with more parts with different sizes, it would still divide it into more parts.

Is the index division of version 0.0.9 a good result?

JorjMcKie commented 22 hours ago

Sorry, but there are limits as to what is recognizable / convertible in terms of header recognition. What we have is a simple font size analysis - not an engine with a semantical analysis in the background.

You can either completely omit header recognition via hdr_info=False, or provide your own callback function which returns a string with multiple "#" characters upon being provided a text span and its Page object.

The same argument hold true when we talk about header / footer identification: it is not there at all. All you can do is excluding rectangular stripes representing margins that you have to impose - there is no semantic analysis available.

Fianax commented 21 hours ago

Thank you.

But what I don't understand is what exactly is passed to the hdr_info to modify the index calculation logic, I mean, I know it is a method that has to return something for the to_markdown method to take it into account.

Do you have any very simple example of how it could be?

JorjMcKie commented 21 hours ago

You must define either a function ("callable") or an object with a method called get_hdr_info. This callable is invoked with 2 parameters, a text span and its Page object. It must return either "" ("this is body text / no header"), or "#...# ". The number of "#" in that string determines the header level - corresponding to HTML's header tags h1, ..., h6.

For the documentation of a text span see here.

JorjMcKie commented 21 hours ago

Example:

def my_headers(span, page=None):
    return ""

md = pymupdf4llm.to_markdown(..., hdr_info=my_headers, ...)
Fianax commented 20 hours ago

thank you very much, I close the issue