Incorrect page number if PDF has a textless page

phiresky / ripgrep-all

rga: ripgrep, but also search in PDFs, E-Books, Office documents, zip, tar.gz, etc.

Other

8.21k stars 177 forks source link

Incorrect page number if PDF has a textless page #106

Closed nod5 closed 1 year ago

nod5 commented 3 years ago

rga outputs incorrect page numbers for matches in PDF files that have at least one textless page after the first page but earlier than the text matching the search pattern. For example a PDF where some pages only contain images.

More specifically when the pdftotext output has two or more adjacent form feed characters (HEX 0c0c) then rga seems to incorrectly interpret that as a single page break.

I don't know rust but I guess somewhere around this location is relevant https://github.com/phiresky/ripgrep-all/blob/master/src/adapters/postproc.rs#L160

This test PDF file c.pdf has two blank pages followed by a page with text "hello world". The command rga hello c.pdf outputs Page 2: HelloWorld but the correct page number is 3.

Environment Windows 10 Pro 20H2 ripgrep_all-v0.9.6-x86_64-pc-windows-msvc.zip poppler v0.89.0 from https://community.chocolatey.org/packages/poppler

update: issue reproduced also in Raspberry Pi OS

nod5 commented 3 years ago

I poked around poppler's pdftotext source for a temporary workaround. In Linux two small edits is enough.

First on line TextOutputDev.cc#L5077 change from eop[8] to eop[16] . Next insert this line at TextOutputDev.cc#L5102 eopLen += uMap->mapUnicode(0x0a, eop + eopLen, sizeof(eop) - eopLen);

Building poppler with those edits makes pdftotext output a form feed character and a linux newline character (HEX 0a) for each page break in the input PDF. With that rga detects page numbers correctly.

I think similar workarounds for poppler on Windows/Mac would involve adding a switch section for eop similar to that on TextOutputDev.cc#L5089 for different newline conventions.

I don't know how to build poppler on Windows however so will probably have to wait for a fix in rga.

phiresky commented 3 years ago

The rust code that does the replacing is pretty hacky since I couldn't figure out how to do it fast and clean. Shouldn't be too hard to fix in the rust code though (around the place you linked above).

nod5 commented 3 years ago

Ok. I first tested a pdftotext workaround that added form feed and space for each pagebreak. But that didn't help. So I assume adjacent form feed chars isn't the issue but rather multiple form feed chars on the same line in the pdftotext output. I guess the rga code currently stops reading a line once its first form feed char is detected?

nod5 commented 3 years ago

One more note, in case others also want a temporary workaround for Windows.

We can modify and compile pdftotext (poppler) on Windows 10 through MSYS2. Two small edits was enough here too.

First on line TextOutputDev.cc#L5077 change from eop[8] to eop[24] . Next insert these two lines at TextOutputDev.cc#L5102

eopLen += uMap->mapUnicode(0x0d, eop + eopLen, sizeof(eop) - eopLen);
eopLen += uMap->mapUnicode(0x0a, eop + eopLen, sizeof(eop) - eopLen);

Building Poppler with those edits makes pdftotext output a form feed character and the Windows newline characters (HEX 0d 0a) for each page break in the input PDF. With that rga detects page numbers correctly.

(I noticed that the extra line with 0d is actually not needed for rga to detect page breaks - Linux newline chars are enough for rga even in Windows 10. But I suppose leaving it out might cause issues in some other pdftotext use cases.)