pzaich / doc_ripper

Parse text contents from common file formats
MIT License
82 stars 18 forks source link

Updates pdf ripper to print result to console so it gets stored in @text #10

Closed weilandia closed 5 years ago

weilandia commented 5 years ago

Currently, when I run DocRipper::rip("path_to_file.pdf") the pdftotext command writes the results to a file ("path_to_file.txt") and an empty string is returned. This change makes it so the result is returned and @text has a value.

pzaich commented 5 years ago

Thanks for the contribution! Do you mind adding a quick spec similar to https://github.com/pzaich/doc_ripper/blob/master/spec/doc_ripper/formats/sketch_ripper_spec.rb#L16 Thanks!

weilandia commented 5 years ago

👍

pzaich commented 5 years ago

@weilandia I'm getting test failures locally. Versions: ruby 2.4.4p296 (2018-03-28 revision 63013) [x86_64-darwin17] pdftotext version 3.03

For example, it looks like a line break is missing.

       expected: "A Simple PDF File\nThis is a small demonstration .pdf file just for use in the Virtual Mechanics tut.... And more text. And more text.\nBoring. More, a little more text. The end, and just as well.\n\n\f"
            got: "A Simple PDF File\nThis is a small demonstration .pdf file just for use in the Virtual Mechanics tut...t. And more text. And more text. Boring. More, a little more text. The end, and just as well.\n\n\f"

This library is not intended to maintain overall document structure -- it's just about extracting the raw text, so if you want to normalize the output in your tests to remove extra whitespace, that works for me.

weilandia commented 5 years ago

@pzaich should be good to go now -- I was running an older version of pdftotext. Decided to just strip the whitespace in the test.