the-paperless-project / paperless

Scan, index, and archive all of your paper documents
GNU General Public License v3.0
7.84k stars 501 forks source link

Trim whitespace characters when getting text from PDF. #712

Closed grembo closed 3 years ago

grembo commented 3 years ago

pdftotext[0] returns one entry per empty page (newline + whitespace). When combined with newline in join, this will create a text that grows with each page.

When getting over a certain document size, this exceeds the 50 character limit for skipping OCRing a page (unless PAPERLESS_OCR_ALWAYS is enabled), resulting in larger documents not being OCRred anymore, but consisting of a couple of whitespace lines.

By stripping the result of pdftotext, text only consisting of such whitespace is shortened, so that OCR can still happen. Text retrieved from pdftotext is a bit nicer that way too as a side-effect.

Also considered trimming each page and leaving empty ones out, but simply stripping the result seemed less intrusive.

[0]Tested with pdftotext 2.1.4.