lizmat / App-Rak

21st century grep / find / ack / ag / rg on steroids
Artistic License 2.0
152 stars 7 forks source link

Search in pdf files #48

Closed librasteve closed 3 months ago

librasteve commented 9 months ago

I have just released PDF::Extract due to not being able to find a simple abstraction over pdftotext and pdf2html CLI tools.

Anyway - based on my superficial understanding of rak, there is no option to open and read pdf files as text...?

Would it be of interest for me to write this as a PR and submit back here? If so, please can you provide a bit of a steer as to how that would integrate with the current rak implmentation:

There's likely a few other, similar things for eg. Office docs (which I am not currently proposing) and maybe a notion of where / how to generally develop a "file suffix preprocessor" might be a good idea... libreoffice --headless --convert-to "txt:Text (encoded):UTF8" mydocument.doc maybe

lizmat commented 3 months ago

Am looking at App::Rak again :-)

I'm looking at making this automatic if PDF::Extract is installed. I only need a way to reliably detect a PDF file. I assume the .pdf extension doesn't cut it in all cases?

librasteve commented 3 months ago

PDF::Extract uses the poppler library

According to https://superuser.com/questions/580887/check-if-pdf-files-are-corrupted-using-command-line-on-linux

You can try doing it with pdfinfo (here on Fedora in the poppler-utils package). pdfinfo gets information about the PDF file from its dictionary, so if it finds it the file should be ok

for f in *.pdf; do
    if ! pdfinfo "$f" &> /dev/null; then
        echo "$f" is broken
    fi
done
librasteve commented 3 months ago
~ > pdfinfo /Users/xxx/Downloads/Policy.pdf
Creator:         GMC Software AG~Inspire Designer~9.0.41.0
Producer:        PDF
CreationDate:    Mon Jan  3 18:40:04 2022 GMT
Custom Metadata: no
Metadata Stream: no
Tagged:          no
UserProperties:  no
Suspects:        no
Form:            none
JavaScript:      no
Pages:           8
Encrypted:       no
Page size:       595.276 x 841.89 pts (A4)
Page rot:        0
File size:       1134532 bytes
Optimized:       no
PDF version:     1.4

or

I/O Error: Couldn't open file '/Users/xxx/Downloads/xxx': No such file or directory.

or

~ > pdfinfo /Users/xxx/Downloads/reading-c2e-v2.csv
Syntax Warning: May not be a PDF file (continuing anyway)
Syntax Error: Couldn't find trailer dictionary
Syntax Error: Couldn't find trailer dictionary
Syntax Error: Couldn't read xref table
tbrowder commented 3 months ago

On debian, the "file" system command seems pretty reliable for detecting pdf files.

librasteve commented 3 months ago

I have bumped PDF::Extract to v0.0.3

You can now go

use PDF::Extract;

my $extract = Extract.new: file => '../resources/sample copy.pdf';

$extract.first = 0;
$extract.last = 2;
$extract.range: 0..1; 

say $extract.text;
say $extract.html;
say $extract.xml;

say $extract.so;   #test for PDF headers
say $extract.info;
say $extract.info<CreationDate>;

or, for this case

 Extract.new(file => '../resources/sample copy.pdf').so;   #True
lizmat commented 3 months ago

This has now been implemented with --pdf-per-line, --pdf-per-file and --pdf-info. So closing now