Closed Fianax closed 20 hours ago
Sorry, but there are limits as to what is recognizable / convertible in terms of header recognition. What we have is a simple font size analysis - not an engine with a semantical analysis in the background.
You can either completely omit header recognition via hdr_info=False
, or provide your own callback function which returns a string with multiple "#" characters upon being provided a text span and its Page
object.
The same argument hold true when we talk about header / footer identification: it is not there at all. All you can do is excluding rectangular stripes representing margins that you have to impose - there is no semantic analysis available.
Thank you.
But what I don't understand is what exactly is passed to the hdr_info
to modify the index calculation logic, I mean, I know it is a method that has to return something for the to_markdown
method to take it into account.
Do you have any very simple example of how it could be?
You must define either a function ("callable") or an object with a method called get_hdr_info
.
This callable is invoked with 2 parameters, a text span and its Page
object.
It must return either ""
("this is body text / no header"), or "#...# "
. The number of "#" in that string determines the header level - corresponding to HTML's header tags h1, ..., h6.
For the documentation of a text span see here.
Example:
def my_headers(span, page=None):
return ""
md = pymupdf4llm.to_markdown(..., hdr_info=my_headers, ...)
thank you very much, I close the issue
Hello again. I wanted to tell you about this new index behavior.
In the version of pdf4llm==0.0.7 when in an index there were two different font sizes as in this pdf:
prueba_indice_distinto_tamaño.pdf
the result was this:
which was a result I thought was correct but in the pdf4llm==0.0.9 version it gives the following (attached result in markdown):
prueba_indice_distinto_tamaño_new_markdown.md
It also makes sense, I think, but this shape breaks the index because it divides it in two parts. Also, I imagine that if the title were longer and with more parts with different sizes, it would still divide it into more parts.
Is the index division of version 0.0.9 a good result?