Get Page Size from Ghostscript

tavinus commented 5 years ago

Time to kill all dependencies

We finally have a solution for getting the page size with ghostscript.
Thank you Stefan Dragnev!
https://stackoverflow.com/a/52644056/1273636

PROS

No need for external dependencies anymore (pdfinfo, identify, etc)
Should never fail on any PDF
We have the sizes of ALL pages

CONS

Seems a bit slower than the other methods

I THINK it is slower also because it is traversing all pages and checking them.
Which means that a version that only gets the size of the first page could be a lot faster.
(need to try and test)
For most operations we only use the size of the first page anyways.

I do want to offer the option for the user to choose the page size though (optional call).

I also want to have the option of listing ALL page sizes (on --info or similar parameter).

Examples calls

$ gs -dNODISPLAY -dQUIET -sFileName=../mixsync\ manual\ v1-2-3.A0.SCALED.pdf -c "FileName (r) file runpdfbegin 1 1 pdfpagecount {pdfgetpage /MediaBox get {=print ( ) print} forall (\n) print} for quit"
0 0 3370 2384
0 0 3370 2384
0 0 3370 2384
0 0 3370 2384
0 0 3370 2384
0 0 3370 2384
0 0 3370 2384
0 0 3370 2384

Without -sFileName

$ gs -q -dNODISPLAY -c "(../mixsync\ manual\ v1-2-3.pdf) (r) file runpdfbegin 1 1 pdfpagecount {pdfgetpage /MediaBox get {=print ( ) print} forall (\n) print} for quit"
0 0 841.89 595.29
0 0 841.89 595.29
0 0 841.89 595.29
0 0 841.89 595.29
0 0 841.89 595.29
0 0 841.89 595.29
0 0 841.89 595.29
0 0 841.89 595.29

Seems like -sFileName is a good idea, since it handles spaces on names.
Also, we may want to include -dBatch and remove quit from the PS script.

Need to test/adapt a bit more and also implement it into the adaptive method.
I will probably use this as second option, if GREP fails (if grep is indeed a lot faster).

I will also probably leave the choice to force the external modes (eg. -m pdfinfo).
It COULD be useful on specific cases.

tavinus commented 5 years ago

Quoting his reply

Here's a breakdown of the command:

FileName (r) file  % open file given by -sFileName
runpdfbegin        % open file as pdf
1 1 pdfpagecount { % for each page index
pdfgetpage       % get pdf page properties (pushes a dict)
/MediaBox get    % get MediaBox value from dict (pushes an array of numbers)
{                % for every array element
=print         % print element value
( ) print      % print single space
} forall
(\n) print       % print new line
} for
quit               % quit interpreter. Not necessary if you pass -dBATCH to gs

Replace /MediaBox with /CropBox to get the crop box.

tavinus commented 2 months ago

Been testing this.
The main problem is that GS can be quite slow.

This test was running 1000 iterations on each method:

Grep is A LOT FASTER (but does not always work).

2 seconds vs 81 seconds is kind of a joke. This will get even worse with big PDF files. The main reason is that GS will load the entire file before it can parse the info, while grep is just reading it until it finds the \MediaBox info in the beginning of the file (and then it drops the execution).

It will be good to have a fail-safe portable solution for getting the page size though.

tavinus commented 2 months ago

Functionality added on v2.6.0

tavinus / pdfScale

Get Page Size from Ghostscript #10

Time to kill all dependencies