pevers / images-scraper

Simple and fast scraper for Google
ISC License
224 stars 69 forks source link

The scraper is getting stuck on the "Accept Cookies" page of google #95

Closed laserman120 closed 1 year ago

laserman120 commented 1 year ago

When starting the search, it starts up chrome, but gets stuck on the "Accept cookies page" in google, it then resizes and closes as it cannot find any images.

https://gyazo.com/13510e16f0993ee936bbe316b9cb08b4

Here is an example of starting the chrome instance and getting stuck

rolyPolyVole commented 1 year ago

happened to me too. The accept cookies page triggers this const isScrollable = await this._isScrollable(page); if (!isScrollable) { console.log('No results on this page'); return; }

laserman120 commented 1 year ago

So after a bit of reading about puppeteer i wrote this together:

//Google Accept All
    //Search for the button with the text "Accept all"
    const [button] = await page.$x("//button[contains(., 'Accept all')]");
    if (button) {
        //Press the button and wait till the page finishes loading
        await button.click();
        await page.waitForNavigation({
            waitUntil: 'networkidle0',
        });
    }

I put that into the node_modules\images-scraper\src\google\scraper.js file right below:

    const page = await browser.newPage();
    await page.setBypassCSP(true);
    await page.goto(query, {
       waitUntil: 'networkidle0',
    });

This certainly isnt the fastest approach but with my limited knowledge it is at least a temporary fix for anyone who has the same issue. This will increase the time it takes to start the search, but i am not sure if another approach would make it significantly faster due to the page needing to load twice now.

If the dev sees it i also added a pull request with the quick fix, so it at least works till someone makes a better version of the fix.

96

laserman120 commented 1 year ago

Another approach would be to use the code i provided above to accept the cookies. You could then use puppeteer to store the cookies locally and then retrieve those on the next session to skip the "accept cookies" page. This would probably result in faster results as it wont load the page twice like in the current fix.

This solution would require fs as it needs to store it locally. Here is my attempt: Starting with line 45

const browser = await puppeteer.launch({
      ...this.puppeteerOptions,
    });
    const page = await browser.newPage();
    await page.setBypassCSP(true);    
    //Load cookies
    if (fs.existsSync('./node_modules/images-scraper/src/google/cookies.json')) {
      const cookies = fs.readFileSync('./node_modules/images-scraper/src/google/cookies.json', 'utf8')

    const deserializedCookies = JSON.parse(cookies)
    await page.setCookie(...deserializedCookies)
    };
    // Cookies retrieved
    await page.goto(query, {
      waitUntil: 'networkidle0',
    });

    //Google Accept All
    //Search for the button with the text "Accept all"
    const [button] = await page.$x("//button[contains(., 'Accept all')]");
    if (button) {
      //Press the button and wait till the page finishes loading
      await button.click();
      await page.waitForNavigation({
        waitUntil: 'networkidle0',
      });
      //Set cookies
      const cookies = await page.cookies()
      const cookieJson = JSON.stringify(cookies)

      fs.writeFileSync('./node_modules/images-scraper/src/google/cookies.json', cookieJson)  
    }
    //back to the rest of the code
    await page.setViewport({ width: 1920, height: 1080 });
    await page.setUserAgent(this.userAgent);

From rough tests i made on my side it improves the search time by about 1.5 seconds when it searches for 25 images. (Tests were conducted with headless false to check what is actually going on)

As this approach would need fs i wont make a pull request for now but it seems to be the faster solution even if we have to store and retrieve a file locally.

@pevers would you mind looking over this issue? As this might be caused by google it could have lead to the scraper no longer working for anyone.

pevers commented 1 year ago

Another approach would be to use the code i provided above to accept the cookies. You could then use puppeteer to store the cookies locally and then retrieve those on the next session to skip the "accept cookies" page. This would probably result in faster results as it wont load the page twice like in the current fix.

This solution would require fs as it needs to store it locally. Here is my attempt: Starting with line 45

const browser = await puppeteer.launch({
      ...this.puppeteerOptions,
    });
    const page = await browser.newPage();
    await page.setBypassCSP(true);    
    //Load cookies
    if (fs.existsSync('./node_modules/images-scraper/src/google/cookies.json')) {
      const cookies = fs.readFileSync('./node_modules/images-scraper/src/google/cookies.json', 'utf8')

    const deserializedCookies = JSON.parse(cookies)
    await page.setCookie(...deserializedCookies)
    };
    // Cookies retrieved
    await page.goto(query, {
      waitUntil: 'networkidle0',
    });

    //Google Accept All
    //Search for the button with the text "Accept all"
    const [button] = await page.$x("//button[contains(., 'Accept all')]");
    if (button) {
      //Press the button and wait till the page finishes loading
      await button.click();
      await page.waitForNavigation({
        waitUntil: 'networkidle0',
      });
      //Set cookies
      const cookies = await page.cookies()
      const cookieJson = JSON.stringify(cookies)

      fs.writeFileSync('./node_modules/images-scraper/src/google/cookies.json', cookieJson)  
    }
    //back to the rest of the code
    await page.setViewport({ width: 1920, height: 1080 });
    await page.setUserAgent(this.userAgent);

From rough tests i made on my side it improves the search time by about 1.5 seconds when it searches for 25 images. (Tests were conducted with headless false to check what is actually going on)

As this approach would need fs i wont make a pull request for now but it seems to be the faster solution even if we have to store and retrieve a file locally.

@pevers would you mind looking over this issue? As this might be caused by google it could have lead to the scraper no longer working for anyone.

Thanks for looking into this and the fix! I'll have a look tonight and test it.

pevers commented 1 year ago

This should now be fixed in: https://github.com/pevers/images-scraper/pull/96

Thanks for the fix!