houfu / redlines

Show the differences between two strings/text as a compact text, in markdown/HTML, in the terminal and more.

https://houfu.github.io/redlines/

MIT License

103 stars 5 forks source link

Read two PDFs. Compare. Redline. #1

Open houfu opened 2 years ago

houfu commented 2 years ago

What I want to do

Given two pdfs, read the text found on them, and produce a redline.

How I might be able to do this.

Using a PDF library like pdfminer, produce a list of paragraphs and compare them. Produce a new PDF of the source, and mark them with the changes.

Limitations

OCR is probably a future feature.
Layout changes might be a future feature.

HRNPH commented 12 months ago

If you declared a solid pipeline of where it should be placed in the code, I can contribute that features mining and extracting via OCR

houfu commented 11 months ago

In my mind this is probably a very important and big feature. What's the minimum feature set? Read and extract only the text (without formatting and pagination) and compare? 🤔

For pipelines, maybe needs a bit of refactoring.

HRNPH commented 11 months ago

I think we should only did it in Text-PDF via some PDF extractor and not image pdf https://www.javatpoint.com/python-libraries-for-pdf-extraction if we use OCR it'll be a waste of time since the text still need to be cleaned after, let's leave the extraction to other tools

houfu commented 10 months ago

@HRNPH The latest commit (#28) provides an example pipeline for files. Are you still interested in taking a stab on PDF files? Let me know your thoughts (including which PDF library you are thinking of using)!

houfu commented 9 months ago

Now open to others to try before I do it myself lol.