pevers / images-scraper

Simple and fast scraper for Google
ISC License
224 stars 69 forks source link

Scraper hangs if page takes too long to render #36

Closed ururk closed 4 years ago

ururk commented 4 years ago

Running the latest version of node.js (12), images-scraper, on OS X Mojave.

When running the code, sometimes the page takes too long to render - so when it is instructed to scroll there are no scrollbars and it cannot reach the end of the page. I've "fixed" this by adding a one-second delay:

await page.setUserAgent(self.userAgent);
await page.waitFor(1000); // Add one second delay to ensure scrollbars are present

I'm not all that familiar with puppeteer, so am not sure if this works the way I think it does, and not sure if there is a better way to do this. However, I can no reliably run results w/o it stopping (100 test runs so far).

ururk commented 4 years ago

The other possibility is that ksb._kvc never appears on the page sometimes. Definitely not this. I also had to bump up my time to 2 seconds, which seems to be working much better.

pevers commented 4 years ago

Thanks for finding this issue @ururk

I think it would be better to wait for page load. That can be achieved by doing this as part of the goto.

await page.goto(query.replace('%', encodeURIComponent(self.keyword)), {
  waitUntil: 'networkidle0'
});
ururk commented 4 years ago

Great! I'm still getting the same issue though - page loads but does not find any results. What I've found, is every time this happens, the meta JSON dos not load on the page (no .rg_meta element) nor are there any .rg_l elements. I see some errors in the web browser console:

Refused to execute inline script because it violates the following Content Security Policy directive: "script-src 'report-sample' 'nonce-5fBQzWdGZVpdbsJiddVttw' 'unsafe-inline'". Note that 'unsafe-inline' is ignored if either a hash or nonce value is present in the source list.

addScriptContent @ VM52 __puppeteer_evaluation_script__:7

No network errors, though, so I'm at a slight loss as to why this is happening. I even tried updating to the latest puppeteer (which includes a newer Chromium) and am still having problems.

pevers commented 4 years ago

Thanks for investigating @ururk . I can't seem to reproduce it

The error you see might be resolved by doing this: https://github.com/puppeteer/puppeteer/blob/master/docs/api.md#pagesetbypasscspenabled

Can you please try? Thanks, I'm reopening this issue.

pevers commented 4 years ago

I managed to reproduce it @ururk . I think this fixes it. I'm pushing it to a new version shortly.

ururk commented 4 years ago

This appears to fix it - I'll run some test searches and let you know how it goes.

ururk commented 4 years ago

I've done about 50 searches - this fixes the problem! I still have random Chromium crashes, but I don't think it's related to this node module's code.