maaaaz / webscreenshot

A simple script to screenshot a list of websites
GNU Lesser General Public License v3.0
654 stars 162 forks source link

Wait for sometime ( webpage loading fully)before taking the screenshot #22

Closed chenyixin-2 closed 5 years ago

chenyixin-2 commented 5 years ago

Hi, I am not very familiar with phantomjs and chrome's api. So how should I change the source code to take the screenshot after the webpage is fully-loaded ?

maaaaz commented 5 years ago

Hello,

To my understanding there's no easy way to know if a page is fully loaded or not. That's why I chose the lazy rendering method which allow good results.

So try to play with the -t timeout option.

Otherwise:

Cheers.

0xmilan commented 4 years ago

The -t timeout option has no effect on this, the default value is 30 seconds already.

I've been experimenting with increasing ajaxTimeout and maxTimeout in webscreenshot.js. Here is an example with the default values of 400 and 800: berkeley_empty

Here is the screenshot after adding 1000 to both values (1400 and 1800): berkeley_full

Can we add an option (-W, --wait) to pass these values to the python script?

maaaaz commented 4 years ago

@milangfx, thanks. When you increased these values, did you encounter more screenshot failures due to the -t timeout value conflicting with the phantomjs ones (like, more screenshots fail to finish because they wait longer) ? Do you think 1400 and 1800 could be safely used as default values ?

0xmilan commented 4 years ago

Good questions. I didn't have any failures, I was not using the -t timeout option, only changed the values in webscreenshot.js I think the 1.4 s and 1.8 s wouldn't conflict with the default 30 sec timeout. I'm not even sure how they relate to each other. I assume the main timeout option (default 30 s) is only relevant to the Chrome and Firefox renderers since PhantomJS has its own settings in webscreenshot.js.

If I remember correctly, at one point a page didn't fully load even with 1400 and 1800, so a bit higher values might be needed for consistent results, something like 2400 - 2800 (?)

I've only tested this with individual URLs so far. I will check the increased timeouts with a huge list of URLs and compare it to the defaults values.

My only concern is that this could potentially increase the run time a lot if multiple URLs don't load immediately (or before the default 400 - 800). So I'm not sure yet about using 1400 and 1800 as default values.

maaaaz commented 4 years ago

Good questions. I didn't have any failures, I was not using the -t timeout option, only changed the values in webscreenshot.js I think the 1.4 s and 1.8 s wouldn't conflict with the default 30 sec timeout. I'm not even sure how they relate to each other. I assume the main timeout option (default 30 s) is only relevant to the Chrome and Firefox renderers since PhantomJS has its own settings in webscreenshot.js.

No, the -t option applies to any renderer: if the renderer reaches that timeout, a SIGKILL is sent to the process.

If I remember correctly, at one point a page didn't fully load even with 1400 and 1800, so a bit higher values might be needed for consistent results, something like 2400 - 2800 (?)

I've only tested this with individual URLs so far. I will check the increased timeouts with a huge list of URLs and compare it to the defaults values.

Yes that would be appreciated, run $ time webscreenshot [options] and dont hesitate to post execution results.

My only concern is that this could potentially increase the run time a lot if multiple URLs don't load immediately (or before the default 400 - 800).

I think I've already did these kind of tests far in the past, I don't really remember the results but that global increase of duration actually rings a bell to me.

So I'm not sure yet about using 1400 and 1800 as default values.

If the tests show that the global duration is increased, I'll keep the current values but implement an option to handle these parameters and document somewhere that they should be specified in case of partial screenshots.

0xmilan commented 4 years ago

No, the -t option applies to any renderer: if the renderer reaches that timeout, a SIGKILL is sent to the process.

What I meant is that if PhantomJS already stops at the 800 ms maxTimeout specified in webscreenshot.js, then the main -t timeout won't be relevant.

I ran three test on 100 URLs, one with the default timeout values, one with 1000 ms added and one with 1500 ms added.

ajaxTimeout: 400, maxTimeout: 800
python webscreenshot.py -v -i 100URLs  102,07s user 15,79s system 167% cpu 1:10,30 total
40 pages loaded, 60 didn't load
ajaxTimeout: 1400, maxTimeout: 1800
python webscreenshot.py -v -i 100URLs  105,77s user 15,94s system 124% cpu 1:37,95 total
97 pages loaded, 3 didn't load
ajaxTimeout: 1900, maxTimeout: 2300,
python webscreenshot.py -v -i 100URLs  105,79s user 16,72s system 117% cpu 1:44 /2m-15,5s
100 pages loaded

So there's a trade-off between run time and pages actually loading. Having a higher max timeout doesn't affect the pages that would load quickly anyway, but obviously having to wait more for individual pages does add up and results in an overall duration increase.

maaaaz commented 4 years ago

I'm not sure to read well the figures, the total time is 1m10s (70s)for the first case, 1m37s (97s) for the second and 1m44s (104s) for the third one ? It's only +50% duration increase for more than +100% successful screenshots.

It is worth it, the primary goal of such tool is to perform the maximum number of successful screenshots.

The execution duration is already addressed through multiprocessing and cannot/doesn't have to be more optimized by lowering the number of successful results.

So I might use the 1900/2300 values and offer a user option to specify them.

Cheers.

0xmilan commented 4 years ago

the total time is 1m10s (70s)for the first case, 1m37s (97s) for the second and 1m44s (104s) for the third one ?

Correct.

It's only +50% duration increase for more than +100% successful screenshots.

Yeah, depends on how you define successful. In my example above, the blank page technically loaded successfully, but there was important content missing since I also wanted the mailing lists to show up so I had to wait a bit longer.

This is just a test with Google Groups, really. Other pages might behave differently. For example it might be that you have everything important already loaded with the default 400 - 800 timeouts and increasing that would only load more ads on the page. I don't know.

What's important content will always depend on the user. Maybe the user wants the ads to load and see how they are displayed.

If you want to set a higher default, I would go for around ajaxTimeout: 1400, maxTimeout: 1800. Then let users know in the README how to change it manually in webscreenshot.js if they don't see the results they want or wire the timeout values to a command line option.

A too high default max timeout can hang the process unnecessarily, e.g. if there's an ad server not responding.

maaaaz commented 4 years ago

Got it, that's clear.

maaaaz commented 4 years ago

--ajax-max-timeouts option added and default values changed in v2.8

0xmilan commented 4 years ago

Thanks for the quick implementation! Works like a charm.

maaaaz commented 4 years ago

Thank you for your feedbacks @milangfx