pdftotext[0] returns one entry per empty page (newline + whitespace).
When combined with newline in join, this will create a text that grows
with each page.
When getting over a certain document size, this exceeds the 50 character
limit for skipping OCRing a page (unless PAPERLESS_OCR_ALWAYS is
enabled), resulting in larger documents not being OCRred anymore,
but consisting of a couple of whitespace lines.
By stripping the result of pdftotext, text only consisting
of such whitespace is shortened, so that OCR can still happen.
Text retrieved from pdftotext is a bit nicer that way too as
a side-effect.
Also considered trimming each page and leaving empty ones out,
but simply stripping the result seemed less intrusive.
pdftotext[0] returns one entry per empty page (newline + whitespace). When combined with newline in join, this will create a text that grows with each page.
When getting over a certain document size, this exceeds the 50 character limit for skipping OCRing a page (unless PAPERLESS_OCR_ALWAYS is enabled), resulting in larger documents not being OCRred anymore, but consisting of a couple of whitespace lines.
By stripping the result of pdftotext, text only consisting of such whitespace is shortened, so that OCR can still happen. Text retrieved from pdftotext is a bit nicer that way too as a side-effect.
Also considered trimming each page and leaving empty ones out, but simply stripping the result seemed less intrusive.
[0]Tested with pdftotext 2.1.4.