pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.08k stars 489 forks source link

ADD strikethrough to Markdown and HTML export #3810

Open arisjr opened 2 weeks ago

arisjr commented 2 weeks ago

Is your feature request related to a problem? Please describe. YES. I'm doing a RAG on an group of brazilian laws and I think that the problem applies to all RAG/LLM community. (I'm new to RAG)

Law and general legislation publications and documents that need to keep track of changes (history) normally don't simply erase text, they strikethrough the text, like the examples below:

https://www.planalto.gov.br/ccivil_03/_ato2004-2006/2006/decreto/d5948.htm https://www.justice.gov/oip/freedom-information-act-5-usc-552

These were HTML examples, but PDFs of this documents follows the same procedure.

This is a markdown exemple that should not be counted.

When the document parsers and loaders like pymupdf4llm (when generating markdowns) and langchain's PyMuPDFLoader extract the text, they extract all the text like it was the same, but, for RAG applications, I think that including strikethrough text on data may lead to false assumptions by the AI, leading to wrong results for the analyst.

Describe the solution you'd like I would like to add strikethrough text type to Markdown and HTML export, for the document loader be able to ignore strikethrough text, if it was chosen by the programmer.

Describe alternatives you've considered 1st add strikethrough on the export of texts (markdown and HTML) of pyMuPDF python libraries and 2nd and also important - add the ability of pyMuPdf loaders to ignore strikethrough text, if the programmer choose to do so.

The second part (the loaders) I think it's with other projects, like langchain.

Additional context None

jamie-lemon commented 2 weeks ago

@arisjr This seems like a reasonable feature request to me!

JorjMcKie commented 2 weeks ago

@arisjr This seems like a reasonable feature request to me!

This is very complex to implement. I am afraid, we won't be able to come up with a solution in the foreseeable future. The trivial case, where strikethrough annotations have been used, unfortunately is the most unpopular one. Usually we instead see horizontal lines drawn (i.e. vector graphics). While PyMuPDF certainly can extract vector graphics, the hard part is to sub-select the relevant ones from all the rest ... which may be gridlines of table cells, harmless underlines of other text and the like which we definitely want to keep - even when some text slightly overlaps here. So extracted text never carries a property like "I am strikethrough text.". Instead, text pieces must be matched with horizontal line pieces.

arisjr commented 2 weeks ago

@arisjr This seems like a reasonable feature request to me!

This is very complex to implement. I am afraid, we won't be able to come up with a solution in the foreseeable future. The trivial case, where strikethrough annotations have been used, unfortunately is the most unpopular one. Usually we instead see horizontal lines drawn (i.e. vector graphics). While PyMuPDF certainly can extract vector graphics, the hard part is to sub-select the relevant ones from all the rest ... which may be gridlines of table cells, harmless underlines of other text and the like which we definitely want to keep - even when some text slightly overlaps here. So extracted text never carries a property like "I am strikethrough text.". Instead, text pieces must be matched with horizontal line pieces.

Thinking in a way to do it, perhaps, with a flag detect_strikethrough, that when is enabled:

This linear algebra may increase a lot of processing on the parsing (I don't know yet), but if chosen by the programmer, must be a motive. Also, it should only work on horizontal strikethrough, for objectiveness. And there is a lot of algorithmic improvements that can be done, like, if char is not on the line range, don't even test for it.

I don't know a lot of PDF parsing, I haven't studied linear algebra in a long time, but maybe it's feasible. What do you think, @JorjMcKie?

I also don't know if there were another tries on this matter by the project, but, after I read your message, I saw that there is a recurrent need on the internet for this solution/feature.

JorjMcKie commented 2 weeks ago

@arisjr yes, thanks for your thoughts. The basic approach must obviously be along the lines of your sketch. I have been trying a few things yesterday. PyMuPDF does support algebraic operations for its geometry objects (point, rectangles, quads and matrices). But this is only a nice ingredient, "coding sugar", not a contribution to the solution.

The problem is how to differentiate vector graphics intended as strikethrough from others - as I think I mentioned. Strikethrough lines are no line vector graphics but thin rectangles - like all the "lines" generated by export software "Office-to-PDF" (Ms Word, LibreOffice, etc.).

arisjr commented 2 weeks ago

The problem is how to differentiate vector graphics intended as strikethrough from others - as I think I mentioned. Strikethrough lines are no line vector graphics but thin rectangles - like all the "lines" generated by export software "Office-to-PDF" (Ms Word, LibreOffice, etc.).

It needs start with something and can evolve from there. (MS Word and libreoffice, are very good starting points indeed!)

Maybe if this horizontal rectangle

Then it is a strikethrough.

The rest of horizontal rectangles should be something else and we should not bother at this moment, like highlights, text boxes, or even a try to redact the text.

This comparison with the font height I don't know if we have data for it on PDF structure or if its available somewhere, just brainstorming here. Sorry.

arisjr commented 2 weeks ago

@JorjMcKie take a look in the code snippet that I've done.

snipet-check-filterST.zip

I have used a code you made some time a go to show how to get rectangles and lines in a document (on stackoverflow).

You can correct it (if I did some mistake) or add more logic to it, like, check the color of the rectangle (if it's the same of the font), this part I didn't know how to do it.

For now it checks and filters for strikethrough texts on a page. Tested with a document made by "Creator Tool: Acrobat PDFMaker 21 para Word". It filters almost 100% of strikethrough. Of course, it's only an idea.

But I also tested with a pdf printed on Mozilla Firefox/linux, and it didn't found no line nor rectangles on strikethrough... What the strikethrough could be?

Office of Information Policy _ The Freedom of Information Act, 5 U.S.C. § 552.pdf

Regards

JorjMcKie commented 2 weeks ago

I am currently testing an algorithm that successfully matches horizontal "lines" with overlapping words. I say "lines", because they apparently always are borderless rectangles with some fill color. The fill color not necessarily (exactly) matching the strike-out text color. After applying appropriate redactions for strike-out words, the above example looks like this: image

Just to prevent unfounded hopes: There is no way to shift the "surviving" text to take the evacuated areas on the page. It will inevitably stay where it is.

Another comment regarding HTML output: There is no way to achieve strike-out output! To confirm, please discuss this in the MuPDF Discord channel.