the-paperless-project / paperless

Scan, index, and archive all of your paper documents
GNU General Public License v3.0
7.84k stars 501 forks source link

[Feature] - Templates for OCR (Zonal OCR) using KULL #701

Closed swissbyte closed 3 years ago

swissbyte commented 3 years ago

Hi

I already saw your pinned post about automation. Looks great! I exactly need a solution which automatically sorts, and tags my PDFs accrding to their content.

Now i have a feature request: Many of the expensive solutions offers Zonal OCR. They implemented something called OCR-Templates. This is nothing else than just a file which defines several boxes where OCR searches for Text. One possibility to select such zones is this project:

https://jsoma.github.io/kull https://github.com/jsoma/kull

I have also recorded a short gif templateSelection

Now the trick: There are multiple zones defined

If we have the possibility to do RegEx or any other StrComp function on every zone itself, we would have an extremly powerfull detection engine.

AND:

If we have the possibility to use some of the content of these fields as metadata, we would have one of the most powerfull intelligent classification engine out there...

What do you think? I have not looked through the code, but if possible, i would like to help to implement this feature.

Thank you very much.