Closed nod5 closed 1 year ago
I poked around poppler's pdftotext source for a temporary workaround. In Linux two small edits is enough.
First on line TextOutputDev.cc#L5077 change from eop[8]
to eop[16]
.
Next insert this line at TextOutputDev.cc#L5102
eopLen += uMap->mapUnicode(0x0a, eop + eopLen, sizeof(eop) - eopLen);
Building poppler with those edits makes pdftotext output a form feed character and a linux newline character (HEX 0a
) for each page break in the input PDF. With that rga detects page numbers correctly.
I think similar workarounds for poppler on Windows/Mac would involve adding a switch section for eop
similar to that on TextOutputDev.cc#L5089 for different newline conventions.
I don't know how to build poppler on Windows however so will probably have to wait for a fix in rga.
The rust code that does the replacing is pretty hacky since I couldn't figure out how to do it fast and clean. Shouldn't be too hard to fix in the rust code though (around the place you linked above).
Ok. I first tested a pdftotext workaround that added form feed and space for each pagebreak. But that didn't help. So I assume adjacent form feed chars isn't the issue but rather multiple form feed chars on the same line in the pdftotext output. I guess the rga code currently stops reading a line once its first form feed char is detected?
One more note, in case others also want a temporary workaround for Windows.
We can modify and compile pdftotext (poppler) on Windows 10 through MSYS2. Two small edits was enough here too.
First on line TextOutputDev.cc#L5077 change from eop[8]
to eop[24]
.
Next insert these two lines at TextOutputDev.cc#L5102
eopLen += uMap->mapUnicode(0x0d, eop + eopLen, sizeof(eop) - eopLen);
eopLen += uMap->mapUnicode(0x0a, eop + eopLen, sizeof(eop) - eopLen);
Building Poppler with those edits makes pdftotext output a form feed character and the Windows newline characters (HEX 0d 0a
) for each page break in the input PDF. With that rga detects page numbers correctly.
(I noticed that the extra line with 0d
is actually not needed for rga to detect page breaks - Linux newline chars are enough for rga even in Windows 10. But I suppose leaving it out might cause issues in some other pdftotext use cases.)
rga outputs incorrect page numbers for matches in PDF files that have at least one textless page after the first page but earlier than the text matching the search pattern. For example a PDF where some pages only contain images.
More specifically when the pdftotext output has two or more adjacent form feed characters (HEX
0c0c
) then rga seems to incorrectly interpret that as a single page break.I don't know rust but I guess somewhere around this location is relevant https://github.com/phiresky/ripgrep-all/blob/master/src/adapters/postproc.rs#L160
This test PDF file c.pdf has two blank pages followed by a page with text "hello world". The command
rga hello c.pdf
outputsPage 2: HelloWorld
but the correct page number is 3.Environment Windows 10 Pro 20H2 ripgrep_all-v0.9.6-x86_64-pc-windows-msvc.zip poppler v0.89.0 from https://community.chocolatey.org/packages/poppler
update: issue reproduced also in Raspberry Pi OS