Why can't I search the contents of PDF files?

phiresky / ripgrep-all

rga: ripgrep, but also search in PDFs, E-Books, Office documents, zip, tar.gz, etc.

Other

7.98k stars 172 forks source link

Why can't I search the contents of PDF files? #142

Closed zhl111 closed 8 months ago

zhl111 commented 2 years ago

I don't know much about programming, please tell me how to do it 1. 2.

zhl111 commented 2 years ago

Here is a detailed explanation of the problem https://github.com/phiresky/ripgrep-all/issues/47#issuecomment-1221336639 VeryCapture_20220821011920

lafrenierejm commented 10 months ago

It looks like rga is finding matching lines in your PDF and the failure occurs when printing those matches because they contain non-UTF-8 bytestrings. AFAIK rga will transparently re-encode UTF-16 (but no other encodings) to UTF-8, so it's likely that

the match contains something that's neither valid UTF-16 nor valid UTF-8,
rga is attempting to print that match verbatim, then
rust's stdlib chokes when attempting to print the non-UTF-8 byte strings.

phiresky commented 8 months ago

I'll assume this is a dupe of #47