Open houfu opened 2 years ago
If you declared a solid pipeline of where it should be placed in the code, I can contribute that features mining and extracting via OCR
In my mind this is probably a very important and big feature. What's the minimum feature set? Read and extract only the text (without formatting and pagination) and compare? 🤔
For pipelines, maybe needs a bit of refactoring.
I think we should only did it in Text-PDF via some PDF extractor and not image pdf https://www.javatpoint.com/python-libraries-for-pdf-extraction if we use OCR it'll be a waste of time since the text still need to be cleaned after, let's leave the extraction to other tools
@HRNPH The latest commit (#28) provides an example pipeline for files. Are you still interested in taking a stab on PDF files? Let me know your thoughts (including which PDF library you are thinking of using)!
Now open to others to try before I do it myself lol.
What I want to do
Given two pdfs, read the text found on them, and produce a redline.
How I might be able to do this.
Using a PDF library like pdfminer, produce a list of paragraphs and compare them. Produce a new PDF of the source, and mark them with the changes.
Limitations