jrmuizel / pdf-extract

A rust library for extracting content from pdfs
364 stars 73 forks source link

Empty output file running extract example on a test pdf file #50

Open bogct0mculhl opened 1 year ago

bogct0mculhl commented 1 year ago

Hi, I'm trying to understand how to use your library, but I'm not able to run your example code corrrectly:

git clone https://github.com/jrmuizel/pdf-extract.git

cd pdf-extract

wget https://orimi.com/pdf-test.pdf

cargo run --example extract pdf-test.pdf

The output file is empty...

cat pdf-test.txt

Using pdftotext the output file is filled with text:

pdftotext -layout pdf-test.pdf

cat pdf-test.txt

PDF Test File

Congratulations, your computer is equipped with a PDF (Portable Document Format)
reader! You should be able to view any of the PDF documents and forms available on
our site. PDF forms are indicated by these icons:   or  .

Yukon Department of Education
Box 2703
Whitehorse,Yukon
Canada
Y1A 2C6

Please visit our website at: http://www.education.gov.yk.ca/

Thanks

joepio commented 1 year ago

Hi @bogct0mculhl! Do you want to use the code as a library or as a CLI executable? If you want to use it as a library, the easiest way to do so is probably this:

let bytes = std::fs::read("path/to/example.pdf").unwrap();
let out = pdf_extract::extract_text_from_mem(&bytes);
assert!(out.contains("Yukon Department of Education"));
jrmuizel commented 1 year ago

That pdf is encrypted which is not currently supported. https://github.com/J-F-Liu/lopdf/issues/168

The extract example will now output a warning about it. https://github.com/jrmuizel/pdf-extract/commit/277fe7c5175eac65fda8dcabb960d1bd6e497505