Closed Kaleidosium closed 4 years ago
Recently the structure returned by Google changed. I need to spent some time to adjust it accordingly. Sorry @IamRifki !
Ah, I see.
Google removed meta data. The current image urls are in the js script. The link is added to DOM after the click on thumbnail to open the preview
@yazer79 If that's the case, is it still possible to scrape the images?
@yazer79 If that's the case, is it still possible to scrape the images?
It is still possible to scrape the image but it is unfortunate that there will be no meta data attached to the image. I also think that the scraper will become a lot slower. I'll give it a shot tonight.
@pevers Any luck?
@pevers Any luck?
Not yet. I think I would have to adjust it to click all images after the first page.
@pevers It's been 13 Days, have you found a solution?
I have the same problem with my C# crawler, I still haven't figured out how to be efficient when retrieving Google images ... It seems to be done on purpose, the biggest crawler does not want us to crawl its content ^^
@scbj yes sure, but the links are still rendered when the item is clicked on. I think that's where we should start
@yazer79 Yes indeed it works, I used this technique to recover album covers (square image). It's just slower but it's a solution.
```csharp
private async static Task CrawlImages ()
{
var images = browser.Document.GetElementsByTagName("img").ToList()
.Where(el => el.GetAttribute("alt").StartsWith("Résultat de recherche d'images"))
.Take(50)
.ToList();
var covers = new List
Any chance of someone doing this for JS?
@IamRifki Before, I think the metadata that was retrieved so far needs to be reviewed. For example, I did not find the MIME type, nor the original dimensions (ow
and oh
still exist in data-*
attributes but no longer correspond to the original size) 🤷♂️
Honestly, I think we would have better luck with Yandex images.
Honestly, I think we would have better luck with Yandex images.
Yes we could, there is every metadata we want with Yandex :
```json { "serp-item":{ "reqid":"1584648168944796-1335806668460957398360992-man2-6155-IMG", "freshness":"normal", "preview":[ { "url":"https://www.pullrequest.com/blog/github-code-review-service/images/github-logo_hub2899c31b6ca7aed8d6a218f0e752fe4_46649_1200x1200_fill_box_center_2.png", "fileSizeInBytes":49971, "w":1200, "h":1200 }, { "url":"https://www.pullrequest.com/blog/github-code-review-service/images/github-logo_hub2899c31b6ca7aed8d6a218f0e752fe4_46649_1200x1200_fill_box_center_2.png", "fileSizeInBytes":49971, "w":1200, "h":1200 } ], "dups":[ { "url":"https://pbs.twimg.com/media/EGim5NPWoAA0DSe.jpg", "fileSizeInBytes":40331, "w":1200, "h":1200 }, { "url":"https://encodedbicoding.com/wp-content/uploads/2019/08/github-logo_hub2899c31b6ca7aed8d6a218f0e752fe4_46649_1200x1200_fill_box_center_2-1024x1024.png", "fileSizeInBytes":62216, "w":1024, "h":1024 }, { "url":"https://encodedbicoding.com/wp-content/uploads/2019/08/github-logo_hub2899c31b6ca7aed8d6a218f0e752fe4_46649_1200x1200_fill_box_center_2-768x768.png", "fileSizeInBytes":42610, "w":768, "h":768 }, { "url":"https://news.fitnyc.edu/wp-content/uploads/2019/07/github-logo_hub2899c31b6ca7aed8d6a218f0e752fe4_46649_1200x1200_fill_box_center_2-400x400.png", "fileSizeInBytes":21964, "w":400, "h":400 }, { "url":"https://thumbnailer.mixcloud.com/unsafe/60x60/extaudio/7/8/8/4/e6df-cf46-4215-8652-01a224c3e85c", "fileSizeInBytes":1340, "w":60, "h":60 } ], "thumb":{ "url":"//im0-tub-com.yandex.net/i?id=b73623df2718023258a09930a6411585&n=13", "size":{ "width":320, "height":320 } }, "snippet":{ "title":"Code Review as a Service on GitHub", "hasTitle":true, "text":"...from professional code reviewers as a part of their GitHub workflow. ", "url":"https://www.pullrequest.com/blog/github-code-review-service/", "domain":"Pullrequest.com", "redirUrl":"http://yandex.com/clck/jsredir?from=yandex.com%3Bimages%2Fsearch%3Bimages%3B%3B&text=&etext=8851.JG6SBOJ42OD7GICSCkcPalAeqijixpkYseMHfW9D70Q.cecaf24f77bc5286274de356bb009abce43d6f0b&uuid=&state=tid_Wvm4RM28ca_MiO4Ne9osTPtpHS9wicjEF5X7fRziVPIHCd9FyQ,,&data=UlNrNmk5WktYejY4cHFySjRXSWhXT2dvRXFMdGp6eUllTUc0NUtqdmV3QXhGemh0ODUtOWkxQTJaVWJrNTFuUGZDWjBhNmdZM056d3pfd0d4X1RtRmcydEM5WmVSa21CSENNbDFRemxhbVAxdTN0NllOcUVMd0xrbmh5eWsyNzdmZVk0QzFmdV94cGNrdlFoOUY5WFh5ZDZTdTA3NW5hag,,&sign=e29933e7f29c5119a214e249ed03f741&keyno=0&b64e=2&l10n=en" }, "detail_url":"/images/search?pos=0&from=tabbar&img_url=https%3A%2F%2Fsoftmap.ru%2Fupload%2Fuf%2Fad5%2Fad574c14aa17a899fd3abbf3cbbec62f.png&text=github&rpt=simage", "img_href":"https://www.pullrequest.com/blog/github-code-review-service/images/github-logo_hub2899c31b6ca7aed8d6a218f0e752fe4_46649_1200x1200_fill_box_center_2.png", "useProxy":false, "pos":0, "id":"8e8fb67d232b769dd33da61fa1a5ae22", "rimId":"1159c8023b033db98b17681eab3530ed", "docid":"Z958BDDFDE12862A4", "greenUrlCounterPath":"8.228.471.241.13.141", "counterPath":"thumb/normal" } } ```
But Yandex will not suit everyone. In my opinion, it would be interesting to support Google and Yandex, as desired. The package name will be aptly named for that !
@pevers Can we do that? What are your thoughts?
@scbj I think Peter wants to focus on Google.
@scbj I'm trying to make a yandex version, but I'm stuck at this:
setInterval(() => {
// See if we have any results
$("data-bem").each((index, element) => {
// Check if we've reached the limit
if (results.length >= limit) {
return resolve(results);
}
const meta = JSON.parse($(element).find("serp-item"));
const item = {
width: meta.dups.w,
height: meta.dups.h,
url: meta.dups.url,
thumb_url: meta.thumb.url,
thumb_width: meta.thumb.size.width,
thumb_height: meta.thumb.size.height,
};
if (!results.filter(result => result.url === item.url).length) {
results.push(item);
}
});
// Check if we've reached the bottom, if yes, exit
if ($(window).scrollTop() + $(window).height() == $(document).height()) {
return resolve(results);
}
// Scroll
$("html, body").animate({ scrollTop: $(document).height() }, 1000);
}, 1000);
});
It's incorrect as it doesn't return any images, is there anything I should change?
It's incorrect as it doesn't return any images, is there anything I should change?
@IamRifki To make it work with Google you can add this at the beginning of the file :
function contentScript (limit) {
const delay = ms => new Promise(resolve => {
setTimeout(resolve, ms)
})
return new Promise(async resolve => {
const results = []
const elements = document.querySelectorAll('a[jsaction="click:J9iaEb;"]')
for (const element of elements) {
try {
element.click()
await delay(120)
const href = element.getAttribute('href').slice(15).split('&')[0]
const height = +element.parentElement.getAttribute('data-oh')
const width = +element.parentElement.getAttribute('data-ow')
results.push({
url: unescape(decodeURI(href)),
height,
width
})
} catch (error) {
results.push({ error: error.toString() })
}
if (results.length > limit) {
break;
}
}
resolve(results)
})
}
And replace the page.evaluate(...)
call by :
const results = await page.evaluate(contentScript, self.limit);
This code can return a maximum of 100 elements. If you want me to help you on a project with Yandex I think you should create another issue or a new repository to not pollute this issue 😉 At least until @pevers has said what it intends to do regarding the search engine used
@scbj yes sure, but the links are still rendered when the item is clicked on. I think that's where we should start
The links are rendered and start with "/imgres?imgurl" when you right click the item. So we don't have to actually load the full res image.
This code can return a maximum of 100 elements. If you want me to help you on a project with Yandex I think you should create another issue or a new repository to not pollute this issue 😉 At least until @pevers has said what it intends to do regarding the search engine used
@scbj I think it is a good idea to support multiple search engines. Yandex is built for it. It is a constant game of catch and mice and as you might have seen I was pretty busy. So a search engine that has an API wouldn't be hard to maintain :).
The links are rendered and start with "/imgres?imgurl" when you right click the item. So we don't have to actually load the full res image.
@smokes I gave it a try 2 weeks ago and I noticed that the complete URLs are loaded once you hover/click the item. So what could potentionally work:
Hey, so after a while I made a script that actually works. One thing to keep in mind if you're going to implement step number 3 (Right click item) is to not use puppeteer for sending right clicks. Just send a "mousedown" MouseEvent in the browser which is almost instant.
Here's a gist containing my approach: https://gist.github.com/smokes/f951a219e85058a051bf11ef8e72780d
I wrote my own solutions for Yahoo and Ecosia, I tried Yandex, but it seems to be a bit complex. https://github.com/IamRifki/alt-image-scraper
I think it is a good idea to support multiple search engines. Yandex is built for it. It is a constant game of catch and mice and as you might have seen I was pretty busy. So a search engine that has an API wouldn't be hard to maintain :).
@pevers Great for multiple search engines 😉 On the other hand I am not sure that I understood. Yandex doesn't have a search API, so what did you mean?
Hey, so after a while I made a script that actually works. One thing to keep in mind if you're going to implement step number 3 (Right click item) is to not use puppeteer for sending right clicks. Just send a "mousedown" MouseEvent in the browser which is almost instant.
@smokes Thank you, we can take inspiration from it! However, the limit
option is a bit confusing here.
So far, so good, I will write a generic implementation (DRY), supporting Google (to respect the initial philosophy) while also implementing Ecosia, Yahoo and Yandex.
@scbj The limit option is the amount of scrolls to the bottom of the page. limit = 1 results in 100 images. Sorry for the confusing code 😄
@smokes No worries 😄 it's just that I think the limit
parameter should refer to the expected number of results
Yeah, well the way its coded is that it scrolls the the bottom multiple times until it reaches the scroll limit or there are no more results and then grab the urls. I don't know of any way to make it scrape simultaneously while scrolling.
@pevers @smokes @IamRifki I worked on it, does this API seem suitable to you?
Yeah, it looks fine by me.
@pevers @smokes @IamRifki I worked on it, does this API seem suitable to you?
Nice! I like the idea of having an interface for different search engines. I do think that we need to move engine specific options as much as possible to the Scraper
instance constructor.
So we call new Scraper({ engine: 'yahoo', ...yahooSpecificOptions })
. That will construct a new YahooScraper
And the YahooScraper
class will throw an error for invalid options when constructed.
The search
interface should only have query
and limit
(not specific engine settings).
Otherwise it would become really difficult to remember what options you should use for what engine.
I'm going to have a look at @smokes his implementation right now.
I have committed a fix. Let me know if it is still broken because it might differ per platform and internet speed.
about an hour ago the results array became empty again
Thanks @letoribo for reporting. I think it is fixed in: https://github.com/pevers/images-scraper/pull/50
yes, i can confirm, thank you. my app does work now: https://spaces3d.herokuapp.com/
the results array is sometimes still empty. fixed this by switching to page.waitForSelector:
https://github.com/letoribo/images-scraper/commit/68737b3167ad4f46e9340b093a6218699e1201a2
the results array is sometimes still empty. fixed this by switching to page.waitForSelector:
Thanks! Can you open a pull request. Then I will merge it.
I went through a debugging hell of things, until I realized that even your example will only return an empty array, at least on my computer. With a fresh install of Node 12, a new project initialized with
npm init
and just your dependency and your example code, I was not able to get ANY result what so ever.results []