tcr / scissors

PDF manipulation in Node.js! Split, join, crop, read, extract, boil, mash, stick them in a stew.
Apache License 2.0
285 stars 45 forks source link

Features/simple rasterizer #17

Closed joe1chen closed 7 years ago

joe1chen commented 7 years ago

Add more generic imageStream function which allows you to 1) Pass in page number and format for output 2) Add option to use simple rasterizer that skips temp file creation and bounding box code. 3) Add option to use the crop box for generating images from PDF. This results in the most accurate resulting image.

I opted to leave the original rasterizer method as is for backwards compatability.

The original rasterizer has several limitations. It uses this funky bounding box code to crop the resulting image. Not only is this super slow, but it is not a good representation of the original page from the PDF. See samples below.

Added new mocha tests:

mocha -t 7000

  Scissors
    #range()
      1) should extract a range of pdf pages (using a stream)
    #range()
      2) should extract a range of pdf pages (using a promise)
    #jpgStream()
rasterize.js: opening temp file
rasterize.js: closing temp file /var/folders/c7/qj7gqpcj06jbb1hkprks5ny80000gn/T/scissors117415-10822-cq8m6v.xt61kh85mi
rasterize.js: closed.
rasterize.js: [ 69, 30, 527, 769 ]
rasterize.js: ended
rasterize.js: Finished writing image.
      ✓ should extract a single jpg page (using default rasterize) (5965ms)
    #pngStream()
rasterize.js: opening temp file
rasterize.js: closing temp file /var/folders/c7/qj7gqpcj06jbb1hkprks5ny80000gn/T/scissors117415-10827-yyne5m.4o2loc4n29
rasterize.js: closed.
rasterize.js: [ 69, 30, 527, 769 ]
rasterize.js: ended
rasterize.js: Finished writing image.
      ✓ should extract a single png page (using default rasterize) (6073ms)
    #jpgStream()
simple_rasterize.js: Finished writing image.
      ✓ should extract a single jpg page (using simple rasterize) (622ms)
    #pngStream()
simple_rasterize.js: Finished writing image.
      ✓ should extract a single png page (using simple rasterize) (891ms)
    #jpgStream()
simple_rasterize.js: Finished writing image.
      ✓ should extract a single jpg page using crop box (using simple rasterize) (619ms)
    #pngStream()
simple_rasterize.js: Finished writing image.
      ✓ should extract a single png page using crop box (using simple rasterize) (876ms)

Output:

Original Rasterizer: _page1_default _page1_default

New simple rasterizer with crop box _page1_simple_crop_box _page1_simple_crop_box

New simple rasterizer without crop box _page1_simple _page1_simple

joe1chen commented 7 years ago

I just noticed that some other commits from @rikkertkoppes got pulled in. Please let me know if you want me to rebase to remove those commits.

cboulanger commented 7 years ago

No, these changes are fine - I had even asked @rikkertkoppes to open a PR. So that's taken care of! I am glad everything merges without any problems so far.

joe1chen commented 7 years ago

By the way, I noticed that the 2 existing tests were failing even though I increased the timeouts to about 20 seconds. Are they working for you?

cboulanger commented 7 years ago

They were failing for me too, but that was due to a buggy Mac OS build of PDFTK. But I just realised something went wrong merging. Let me clear that up first and then I'll get back to you...

rikkertkoppes commented 7 years ago

Nice to see renewed progress here. I no longer work at theapsgroup where I made those commits and also quite rarely work with pdf workflows anymore.

However, if there are any questions about my commits, let me know, I'll be happy to help out

cboulanger commented 7 years ago

Ok, the merge removed all of my own commits and I had to revert it. I am fairly new to the more advanced features of GitHub. I probably have to cherry-pick the individual commits? Maybe a rebase helps? I am happy to include all of your work here...

cboulanger commented 7 years ago

Hm, the two tests pass here. But let's finish merging the code first, then cleanup, and then we can test the hell out of this little jewel.

cboulanger commented 7 years ago

@rikkertkoppes Thank you. It would be super helpful if you could create a pull request that only contained the following two commits

But if that is too much trouble, do not bother, then I will cherry-pick the commits, I need to learn how to do that anyways.

rikkertkoppes commented 7 years ago

I would have to clone and cherry pick as well as I don't have access to the original repo anymore. I can do that later this week if you want.

On Mon, 15 May 2017, 21:41 Christian Boulanger, notifications@github.com wrote:

@rikkertkoppes https://github.com/rikkertkoppes Thank you. It would be super helpful if you could create a pull request that only contains the following two commits

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tcr/scissors/pull/17#issuecomment-301582056, or mute the thread https://github.com/notifications/unsubscribe-auth/AASe5VM0jnFQWTzsRkvzOoqrHgHINPobks5r6KqBgaJpZM4NbfTe .

cboulanger commented 7 years ago

Thanks! That'd be great. I can edit the changes in by hand, but then you wouldn't get credit for it. I'd rather merge properly to acknowledge who has been contributing.

cboulanger commented 7 years ago

@rikkertkoppes Hi, I think I have merged all enhancements except encryption and PDFTKCMD constant and will be releasing a new version soon. Would you have time to create a PR with these two enhancements, so that you'll get the credit -- or blame :-) -- for it? If you're at it, a mocha test would be great, either in the pages.js script or in a separate script. Thank you!