pofider / phantom-html-to-pdf

Highly scalable html to pdf conversion using phantom workers
MIT License
159 stars 33 forks source link

support waitForJS execution in phantom web page (callback based) #15

Closed bjrmatos closed 9 years ago

bjrmatos commented 9 years ago

this implementation doesn't invoke the PDF rendering process until the phantom web page invoke the callback PHANTOM_HTML_TO_PDF_DONE, this function communicates with the phantom process (via a fake request) and let the rendering process to start.

this implementation is more concerned about the jsreport use case (or scenarios where the user has control of the html -> local html) .. i prefer this implementation because it let the user invoke a method, this method could be used to pass some data to the phantom process (ej: PHANTOM_HTML_TO_PDF_DONE({ customData: 'xxxxx' });) and do something.. i don't know a use case for it but we could find something in the future :), also it is more elegant (function invocation instead of global condition in the webpage) and with more performance (instead of evaluate the page every x times checking for some condition to meet)

i will add the new parameter to the docs if we decide that this should be merged.

also, i think we should support the approach that #14 uses (eval based).. it is more useful when converting external pages (http://google.com), waiting for some condition to meet is the best way to do it in external pages.

right now, my implementation (callback based) uses a boolean configuration (waitForJS: true), in order to support the other implementation (eval based) i think we can accept a string (waitForJS: 'return window.pageLoaded === true;') and based on it resolve with the corresponding implementation (callback/eval based)

PaddyMann commented 9 years ago

Looks good :)

Suggested tweaks:

  1. Split out the 2 pieces of functionality: waitForRender: true (wait for the page to render), and afterPageLoad: function () {} (execute a custom function against the page after the page has loaded, waiting for render if waitForRender is true)
  2. Rename PHANTOM_HTML_TO_PDF_DONE to PHANTOM_HTML_TO_PDF_READY (we're ready to convert the page, but we're done with converting it yet)
  3. Don't make PHANTOM_HTML_TO_PDF_READY a function. It will cause an error when the page isn't being converted. We should still be able to inject and execute a custom function at the end of the page without it being called within the page's pre-existing JS. Instead make it a variable and wait for it to return true.
  4. Park afterPageLoad or add to a separate pull request so we can add the waitForRender functionality sooner :)

As you've identified, this solution relies on access to the target page code. You might want to also support:

For myself, and probably the majority of users, simply supporting the ability to set PHANTOM_HTML_TO_PDF_READY within the page would meet my needs perfectly. So I'd actually rather see any further customization parked and this feature integrated ASAP :)

How quickly could we get this integrated and added to the npm package?

PaddyMann commented 9 years ago

You might also want to add a test that the page times out rather than renders if window.PHANTOM_HTML_TO_PDF_READY is never set to true

bjrmatos commented 9 years ago
  1. Split out the 2 pieces of functionality: waitForRender: true (wait for the page to render), and afterPageLoad: function () {} (execute a custom function against the page after the page has loaded, waiting for render if waitForRender is true)

you mean adding this two features? waiting for the page to render/load is something different that waiting for page's javascript execution, but if it is something related with the name of the configuration (waitForRender vs waitForJS) i prefer waitForJS since this feature is related to javascript execution and want to be explicit on that

  1. Rename PHANTOM_HTML_TO_PDF_DONE to PHANTOM_HTML_TO_PDF_READY (we're ready to convert the page, but we're done with converting it yet)

i'm ok with that :)

  1. Don't make PHANTOM_HTML_TO_PDF_READY a function.

i prefer a function here, because it is more explicit and allows other possibilities (as explained before)

It will cause an error when the page isn't being converted.

what do you mean with that? if someone adds waitForJS: true in config is because it has total control of the page (it's not an external page), and must invoke the callback (PHANTOM_HTML_TO_PDF_DONE), if the user don't do that, the process will fail because the timeout.. other than that i don't see a scenario when the process fails (we can improve the timeout error message when waitForJS: true)

Customizable variables (so I can choose to use window.renderComplete rather than window.PHANTOM_HTML_TO_PDF_READY.

i was thinking the same.. lets wait for the author's feedback before do anything

Searching for elements to check for page load. e.g. checking if an element with a given ID exists, or an element with a given css class containing certain text ... which leads to supporting a function and allowing the user to customize what they want (like my pull request..). But would love to see this done without an eval :)

that functionality has other semantics than waiting for page's javascript execution and deserves it's own PR.. it is more related to waitForRender, which is different than waitForJS, waitForJS is designed for the user can explicitly indicate that the page and its custom javascript has ended.

after thinking this, instead of support the eval based implementation in waitForJS, we should support something like waitForRender (related to page load, waiting for the precense of element, checking if an element with a given ID exists, etc) which has other semantics than waitForJS and deserves its own configuration, and should be discussed in another issue/PR

PaddyMann commented 9 years ago

Happy to use an alternative name if waitForRender is confusing. The reason I chose it, is because we're waiting for the PDF to be ready to produce. For us, the rendering of the report includes the JS making changes (e.g. applying translations). If the JS hasn't yet finished, we would say it hasn't finished rendering. But simply waitFor / waitForJS fine (could use waitFor and then accept boolean, string or function...)


Adding a function call to a function on your page that doesn't exist will cause an error: Uncaught TypeError: window.PHANTOM_HTML_TO_PDF_DONE is not a function. This error won't occur if we access the page using phantom-html-to-pdf, but will occur if we simply open the page in (e.g.) Chrome. Which would be an issue for us.

The second issue I can imagine, is how do you make sure that our phantom-html-to-pdf script has inserted the function before it is called?

bjrmatos commented 9 years ago

hey @pofider i think i found a bug in phantom-workers (the callback for phantom.execute is called two times and in different order, probably a race condition..) when trying to implement a timeout test for waitForJS.

i have published the code in https://github.com/bjrmatos/phantom-html-to-pdf/tree/timeout-test-waitForJS, just run npm test multiple times to reproduce the errors

some screenshoots:

captura de pantalla 2015-10-11 a las 12 28 04 p m captura de pantalla 2015-10-11 a las 12 36 40 p m

as you see, the callback is called two times and the errors come in different order

node: 0.10.40 os: OSX

bjrmatos commented 9 years ago

The second issue I can imagine, is how do you make sure that our phantom-html-to-pdf script has inserted the function before it is called?

because i have used:

page.onInitialized = function() {
    // inject function to the page in order to the client can instruct the ending of its JS
    if (body.waitForJS) {
        page.evaluate(function() {
            window.PHANTOM_HTML_TO_PDF_DONE = function() {
              var scriptNode = document.createElement("script");
              scriptNode.src = 'http://intruct-javascript-ending';
              document.body.appendChild(scriptNode);
            };
        });
    }
};

the script will be always defined in page before any user's code, see http://phantomjs.org/api/webpage/handler/on-initialized.html (specifically this part: This callback is invoked after the web page is created but before a URL is loaded.)

Adding a function call to a function on your page that doesn't exist will cause an error: Uncaught TypeError: window.PHANTOM_HTML_TO_PDF_DONE is not a function. This error won't occur if we access the page using phantom-html-to-pdf, but will occur if we simply open the page in (e.g.) Chrome.

yeah, i know that, but if your intention is to use your page for other reasons than generating PDF you can do:

// or typeof window.PHANTOM_HTML_TO_PDF_DONE !== "function"
if (typeof window.PHANTOM_HTML_TO_PDF_DONE !== "undefined") {
  window.PHANTOM_HTML_TO_PDF_DONE();
}
PaddyMann commented 9 years ago

fair enough. I'm now happy with this for my use-case :)

A polling-based solution:

This looks good as it meets my specific use-case, doesn't use eval, and may have a performance benefit (though I'm unclear on whether this would be a non-trivial improvement)

My preference is to get either implemented in the codebase ASAP. Thanks for your efforts @bjrmatos :)

bjrmatos commented 9 years ago

Exactly! A eval/polling based solution is the best for external pages. Thnks to you for being patient

bjrmatos commented 9 years ago

@pofider i have found the solution for the bug.. will submit a PR in phantom-workers tomorrow (now it is too late in my country)

pofider commented 9 years ago

rename PHANTOM_HTML_TO_PDF_DONE to _READY

Yes

function vs variable for PHANTOM_HTML_TO_PDF_READY

Did you also consider using an object property setter? It won't fail when the property is not defined and still allows you to set a custom parameter through it without polling.

page.onInitialized = function() {   
    if (body.waitForJS) {
        page.evaluate(function() {
            Object.defineProperty(window, 'PHANTOM_HTML_TO_PDF_READY', {
                set: function(val) {
                    if (!val)
                        return;

                    var scriptNode = document.createElement("script");
                    scriptNode.src = 'http://intruct-javascript-ending';
                    document.body.appendChild(scriptNode);
                }
            });          
        });
    }
};

custom name for PHANTOM_HTML_TO_PDF_READY

Yes, why not.


The approach suggested by @bjrmatos is really great and I will happily merge it and release it when it is done. Afterwards we can take a look at the polling and eval based implementation for the external html use case.

Big thanks to both of you.

bjrmatos commented 9 years ago

wonderful news! i will finish the implementation in the next hours (after work..), the object property setter is a great solution! :tada:

bjrmatos commented 9 years ago

done! i want to add a timeout test but phantom-workers has a little bug.. i will submit the patch in phantom-workers and come back for the timeout test

bjrmatos commented 9 years ago

ok PR submitted, when a new version of phantom-workers (when pofider/phantom-workers#8 is resolved) is published on npm i will update the phantom-workers version in dependencies and add the timeout test.. all will be ready to merge

PaddyMann commented 9 years ago

Good job!

If the update is available on npm by Friday, I'll start using it against our site then. Else I'm on holiday next week and will use it when I return :)

bjrmatos commented 9 years ago

we're ready! :)

pofider commented 9 years ago

Published. Thanks.