pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
302 stars 57 forks source link

Issues with bullet points in PDFs #81

Closed Jaimish00 closed 1 week ago

Jaimish00 commented 2 months ago

Hello there,

First of all thanks for this amazing library 🙌

I am facing some issues with the bullet points in the generated markdown. I have tried several different kinds of bullet points to test if the markdown contains the bullet points and indented bullet points.

For example, I just created a Document file with some bullet points, which you can see below

image

Now I exported this doc as a PDF and tried running to_markdown on this, and as a result, I got this as an output

Hello There\n\nI am testing the bullet points here\n\nâ—‹ Just to see if markdown is generated properly\n\nâ–  And is it able to keep the formatting intact\n\n\n-----\n\n

There are few observations that I have made looking at this output,

  1. It's not able to get the indentation correctly, it's just adding new lines \n but not \t
  2. The first level bullet point is not getting rendered at all, as you can see in the first and second line it is not appending - before the text, and looking at the codebase I saw that there is this bullet list that is getting compared with https://github.com/pymupdf/RAG/blob/8c0f5009f3d121a9679445b7b551318d77dd967c/pymupdf4llm/pymupdf4llm/helpers/pymupdf_rag.py#L43-L51

But digging a bit deeper in the code, I found that sometimes the bullet points are not even getting parsed in the text, to check against this bullet list https://github.com/pymupdf/RAG/blob/8c0f5009f3d121a9679445b7b551318d77dd967c/pymupdf4llm/pymupdf4llm/helpers/pymupdf_rag.py#L505-L506

Can anyone help me with this?

JorjMcKie commented 2 months ago

Thanks for the feedback!

Please always provide an example PDF page for problem reproduction. In your specific situation you might want to suggest additional bullet point characters to add to that list.

Jaimish00 commented 2 months ago

LLM - Bullet Points test.pdf

Sure, I've attached it. I just created this simple doc using Google Docs to try.

Moreover, I have been using this package for more complex cases that includes parsing different kinds of PDFs of Documentation, and Wiki pages, and there we might have other types of bullet points, so at that moment this small number of bullet lists might not be sufficient

Jaimish00 commented 1 month ago

Hey @JorjMcKie

Any updates on this?

JorjMcKie commented 1 month ago

Yes - there is no current support for multi-level bullet points. This will not be implemented any time soon either.

The more basic issue (of not recognized bullets) is still under investigation. I am currently out of town, so bear with me for at least another week or maybe two.

JorjMcKie commented 1 week ago

Fixed in v0.0.17.