pevers / images-scraper

Simple and fast scraper for Google
ISC License
229 stars 71 forks source link

Example code returns an empty array. #42

Closed Kaleidosium closed 4 years ago

Kaleidosium commented 4 years ago

I went through a debugging hell of things, until I realized that even your example will only return an empty array, at least on my computer. With a fresh install of Node 12, a new project initialized with npm init and just your dependency and your example code, I was not able to get ANY result what so ever.

Terminal Things.


➜  test node:(v12.16.1) npm start

test@1.0.0 start /Users/alt/Documents/Code/_private/test
node index.js

results []

pevers commented 4 years ago

Recently the structure returned by Google changed. I need to spent some time to adjust it accordingly. Sorry @IamRifki !

Kaleidosium commented 4 years ago

Ah, I see.

yazer79 commented 4 years ago

Google removed meta data. The current image urls are in the js script. The link is added to DOM after the click on thumbnail to open the preview

Kaleidosium commented 4 years ago

@yazer79 If that's the case, is it still possible to scrape the images?

pevers commented 4 years ago

@yazer79 If that's the case, is it still possible to scrape the images?

It is still possible to scrape the image but it is unfortunate that there will be no meta data attached to the image. I also think that the scraper will become a lot slower. I'll give it a shot tonight.

Kaleidosium commented 4 years ago

@pevers Any luck?

pevers commented 4 years ago

@pevers Any luck?

Not yet. I think I would have to adjust it to click all images after the first page.

Kaleidosium commented 4 years ago

@pevers It's been 13 Days, have you found a solution?

scbj commented 4 years ago

I have the same problem with my C# crawler, I still haven't figured out how to be efficient when retrieving Google images ... It seems to be done on purpose, the biggest crawler does not want us to crawl its content ^^

yazer79 commented 4 years ago

@scbj yes sure, but the links are still rendered when the item is clicked on. I think that's where we should start

scbj commented 4 years ago

@yazer79 Yes indeed it works, I used this technique to recover album covers (square image). It's just slower but it's a solution.

SearchEngine.CrawlImages() in C#

```csharp private async static Task CrawlImages () {     var images = browser.Document.GetElementsByTagName("img").ToList()         .Where(el => el.GetAttribute("alt").StartsWith("Résultat de recherche d'images"))         .Take(50)         .ToList();     var covers = new List();     for (int i = 0; i < images.Count; i++)     {         HtmlElement image = images[i];         // We want square images         if (image.ClientRectangle.Width != image.ClientRectangle.Height) continue;         // Open right pane viewer         image.InvokeMember("click");         await Task.Delay(350);         // Retreive new added attribute value and extract image url         string href = image.Parent.Parent.GetAttribute("href");         string encoded = href.Replace("/imgres?imgurl=", "").Split('&')[0];         var cover = new Cover(             rank: i +1,             url: encoded.DecodeUrl()         );         covers.Add(cover);         if (covers.Count == 20) break;     }     SearchCompleted?.Invoke(covers); } ```

Kaleidosium commented 4 years ago

Any chance of someone doing this for JS?

scbj commented 4 years ago

@IamRifki Before, I think the metadata that was retrieved so far needs to be reviewed. For example, I did not find the MIME type, nor the original dimensions (ow and oh still exist in data-* attributes but no longer correspond to the original size) 🤷‍♂️

https://github.com/pevers/images-scraper/blob/877c9acdb75e15de9139f565ac46c89fc1359e38/lib/google-images-scraper.js#L62-L71

Kaleidosium commented 4 years ago

Honestly, I think we would have better luck with Yandex images.

scbj commented 4 years ago

Honestly, I think we would have better luck with Yandex images.

Yes we could, there is every metadata we want with Yandex :

JSON object in `data-bem` attribute

```json { "serp-item":{ "reqid":"1584648168944796-1335806668460957398360992-man2-6155-IMG", "freshness":"normal", "preview":[ { "url":"https://www.pullrequest.com/blog/github-code-review-service/images/github-logo_hub2899c31b6ca7aed8d6a218f0e752fe4_46649_1200x1200_fill_box_center_2.png", "fileSizeInBytes":49971, "w":1200, "h":1200 }, { "url":"https://www.pullrequest.com/blog/github-code-review-service/images/github-logo_hub2899c31b6ca7aed8d6a218f0e752fe4_46649_1200x1200_fill_box_center_2.png", "fileSizeInBytes":49971, "w":1200, "h":1200 } ], "dups":[ { "url":"https://pbs.twimg.com/media/EGim5NPWoAA0DSe.jpg", "fileSizeInBytes":40331, "w":1200, "h":1200 }, { "url":"https://encodedbicoding.com/wp-content/uploads/2019/08/github-logo_hub2899c31b6ca7aed8d6a218f0e752fe4_46649_1200x1200_fill_box_center_2-1024x1024.png", "fileSizeInBytes":62216, "w":1024, "h":1024 }, { "url":"https://encodedbicoding.com/wp-content/uploads/2019/08/github-logo_hub2899c31b6ca7aed8d6a218f0e752fe4_46649_1200x1200_fill_box_center_2-768x768.png", "fileSizeInBytes":42610, "w":768, "h":768 }, { "url":"https://news.fitnyc.edu/wp-content/uploads/2019/07/github-logo_hub2899c31b6ca7aed8d6a218f0e752fe4_46649_1200x1200_fill_box_center_2-400x400.png", "fileSizeInBytes":21964, "w":400, "h":400 }, { "url":"https://thumbnailer.mixcloud.com/unsafe/60x60/extaudio/7/8/8/4/e6df-cf46-4215-8652-01a224c3e85c", "fileSizeInBytes":1340, "w":60, "h":60 } ], "thumb":{ "url":"//im0-tub-com.yandex.net/i?id=b73623df2718023258a09930a6411585&n=13", "size":{ "width":320, "height":320 } }, "snippet":{ "title":"Code Review as a Service on GitHub", "hasTitle":true, "text":"...from professional code reviewers as a part of their GitHub workflow. ", "url":"https://www.pullrequest.com/blog/github-code-review-service/", "domain":"Pullrequest.com", "redirUrl":"http://yandex.com/clck/jsredir?from=yandex.com%3Bimages%2Fsearch%3Bimages%3B%3B&text=&etext=8851.JG6SBOJ42OD7GICSCkcPalAeqijixpkYseMHfW9D70Q.cecaf24f77bc5286274de356bb009abce43d6f0b&uuid=&state=tid_Wvm4RM28ca_MiO4Ne9osTPtpHS9wicjEF5X7fRziVPIHCd9FyQ,,&data=UlNrNmk5WktYejY4cHFySjRXSWhXT2dvRXFMdGp6eUllTUc0NUtqdmV3QXhGemh0ODUtOWkxQTJaVWJrNTFuUGZDWjBhNmdZM056d3pfd0d4X1RtRmcydEM5WmVSa21CSENNbDFRemxhbVAxdTN0NllOcUVMd0xrbmh5eWsyNzdmZVk0QzFmdV94cGNrdlFoOUY5WFh5ZDZTdTA3NW5hag,,&sign=e29933e7f29c5119a214e249ed03f741&keyno=0&b64e=2&l10n=en" }, "detail_url":"/images/search?pos=0&from=tabbar&img_url=https%3A%2F%2Fsoftmap.ru%2Fupload%2Fuf%2Fad5%2Fad574c14aa17a899fd3abbf3cbbec62f.png&text=github&rpt=simage", "img_href":"https://www.pullrequest.com/blog/github-code-review-service/images/github-logo_hub2899c31b6ca7aed8d6a218f0e752fe4_46649_1200x1200_fill_box_center_2.png", "useProxy":false, "pos":0, "id":"8e8fb67d232b769dd33da61fa1a5ae22", "rimId":"1159c8023b033db98b17681eab3530ed", "docid":"Z958BDDFDE12862A4", "greenUrlCounterPath":"8.228.471.241.13.141", "counterPath":"thumb/normal" } } ```

But Yandex will not suit everyone. In my opinion, it would be interesting to support Google and Yandex, as desired. The package name will be aptly named for that !

@pevers Can we do that? What are your thoughts?

Kaleidosium commented 4 years ago

@scbj I think Peter wants to focus on Google.

Kaleidosium commented 4 years ago

@scbj I'm trying to make a yandex version, but I'm stuck at this:

setInterval(() => {
                    // See if we have any results
                    $("data-bem").each((index, element) => {
                        // Check if we've reached the limit
                        if (results.length >= limit) {
                            return resolve(results);
                        }

                        const meta = JSON.parse($(element).find("serp-item"));
                        const item = {
                            width: meta.dups.w,
                            height: meta.dups.h,
                            url: meta.dups.url,
                            thumb_url: meta.thumb.url,
                            thumb_width: meta.thumb.size.width,
                            thumb_height: meta.thumb.size.height,
                        };

                        if (!results.filter(result => result.url === item.url).length) {
                            results.push(item);
                        }
                    });

                    // Check if we've reached the bottom, if yes, exit
                    if ($(window).scrollTop() + $(window).height() == $(document).height()) {
                        return resolve(results);
                    }

                    // Scroll
                    $("html, body").animate({ scrollTop: $(document).height() }, 1000);
                }, 1000);
            });

It's incorrect as it doesn't return any images, is there anything I should change?

scbj commented 4 years ago

It's incorrect as it doesn't return any images, is there anything I should change?

@IamRifki To make it work with Google you can add this at the beginning of the file :

function contentScript (limit) {
  const delay = ms => new Promise(resolve => {
    setTimeout(resolve, ms)
  })
  return new Promise(async resolve => {
    const results = []
    const elements = document.querySelectorAll('a[jsaction="click:J9iaEb;"]')
    for (const element of elements) {
      try {
        element.click()
        await delay(120)
        const href = element.getAttribute('href').slice(15).split('&')[0]
        const height = +element.parentElement.getAttribute('data-oh')
        const width = +element.parentElement.getAttribute('data-ow')
        results.push({
          url: unescape(decodeURI(href)),
          height,
          width
        })
      } catch (error) {
        results.push({ error: error.toString() })
      }

      if (results.length > limit) {
        break;
      }
    }
    resolve(results)
  })
}

And replace the page.evaluate(...) call by :

const results = await page.evaluate(contentScript, self.limit);

This code can return a maximum of 100 elements. If you want me to help you on a project with Yandex I think you should create another issue or a new repository to not pollute this issue 😉 At least until @pevers has said what it intends to do regarding the search engine used

smokes commented 4 years ago

@scbj yes sure, but the links are still rendered when the item is clicked on. I think that's where we should start

The links are rendered and start with "/imgres?imgurl" when you right click the item. So we don't have to actually load the full res image.

pevers commented 4 years ago

This code can return a maximum of 100 elements. If you want me to help you on a project with Yandex I think you should create another issue or a new repository to not pollute this issue 😉 At least until @pevers has said what it intends to do regarding the search engine used

@scbj I think it is a good idea to support multiple search engines. Yandex is built for it. It is a constant game of catch and mice and as you might have seen I was pretty busy. So a search engine that has an API wouldn't be hard to maintain :).

The links are rendered and start with "/imgres?imgurl" when you right click the item. So we don't have to actually load the full res image.

@smokes I gave it a try 2 weeks ago and I noticed that the complete URLs are loaded once you hover/click the item. So what could potentionally work:

  1. Load page
  2. Open next thumbnail
  3. Right click item
  4. Fetch url
  5. Continue to 2 until there are no more items
  6. Scroll if possible or exit
  7. Repeat 3
smokes commented 4 years ago

Hey, so after a while I made a script that actually works. One thing to keep in mind if you're going to implement step number 3 (Right click item) is to not use puppeteer for sending right clicks. Just send a "mousedown" MouseEvent in the browser which is almost instant.

Here's a gist containing my approach: https://gist.github.com/smokes/f951a219e85058a051bf11ef8e72780d

Kaleidosium commented 4 years ago

I wrote my own solutions for Yahoo and Ecosia, I tried Yandex, but it seems to be a bit complex. https://github.com/IamRifki/alt-image-scraper

scbj commented 4 years ago

I think it is a good idea to support multiple search engines. Yandex is built for it. It is a constant game of catch and mice and as you might have seen I was pretty busy. So a search engine that has an API wouldn't be hard to maintain :).

@pevers Great for multiple search engines 😉 On the other hand I am not sure that I understood. Yandex doesn't have a search API, so what did you mean?

Hey, so after a while I made a script that actually works. One thing to keep in mind if you're going to implement step number 3 (Right click item) is to not use puppeteer for sending right clicks. Just send a "mousedown" MouseEvent in the browser which is almost instant.

@smokes Thank you, we can take inspiration from it! However, the limit option is a bit confusing here.

So far, so good, I will write a generic implementation (DRY), supporting Google (to respect the initial philosophy) while also implementing Ecosia, Yahoo and Yandex.

smokes commented 4 years ago

@scbj The limit option is the amount of scrolls to the bottom of the page. limit = 1 results in 100 images. Sorry for the confusing code 😄

scbj commented 4 years ago

@smokes No worries 😄 it's just that I think the limit parameter should refer to the expected number of results

smokes commented 4 years ago

Yeah, well the way its coded is that it scrolls the the bottom multiple times until it reaches the scroll limit or there are no more results and then grab the urls. I don't know of any way to make it scrape simultaneously while scrolling.

scbj commented 4 years ago

@pevers @smokes @IamRifki I worked on it, does this API seem suitable to you?

Kaleidosium commented 4 years ago

Yeah, it looks fine by me.

pevers commented 4 years ago

@pevers @smokes @IamRifki I worked on it, does this API seem suitable to you?

Nice! I like the idea of having an interface for different search engines. I do think that we need to move engine specific options as much as possible to the Scraper instance constructor.

So we call new Scraper({ engine: 'yahoo', ...yahooSpecificOptions }) . That will construct a new YahooScraper And the YahooScraper class will throw an error for invalid options when constructed.

The search interface should only have query and limit (not specific engine settings).

Otherwise it would become really difficult to remember what options you should use for what engine.

I'm going to have a look at @smokes his implementation right now.

pevers commented 4 years ago

I have committed a fix. Let me know if it is still broken because it might differ per platform and internet speed.

letoribo commented 4 years ago

about an hour ago the results array became empty again

pevers commented 4 years ago

Thanks @letoribo for reporting. I think it is fixed in: https://github.com/pevers/images-scraper/pull/50

letoribo commented 4 years ago

yes, i can confirm, thank you. my app does work now: https://spaces3d.herokuapp.com/

letoribo commented 4 years ago

the results array is sometimes still empty. fixed this by switching to page.waitForSelector:

https://github.com/letoribo/images-scraper/commit/68737b3167ad4f46e9340b093a6218699e1201a2

pevers commented 4 years ago

the results array is sometimes still empty. fixed this by switching to page.waitForSelector:

letoribo@68737b3

Thanks! Can you open a pull request. Then I will merge it.