medialab / artoo

artoo.js - the client-side scraping companion.
http://medialab.github.io/artoo/
MIT License
1.1k stars 93 forks source link

Image urls change working in Node (works in browser console) #270

Closed jasan-s closed 7 years ago

jasan-s commented 7 years ago

For some reason the scraped image url's change when working with cheerio in node . i.e the original image url is :

"https://images-na.ssl-images-amazon.com/images/M/MV5BNWU4NmY3MTMtMTBmMi00NjFjLTkwMmItYWZhZWUwNDg5M2ExXkEyXkFqcGdeQXVyNDUyOTg3Njg@._V1_SX300.jpg"

However after scraping the Url turns to this url:

http://ia.media-imdb.com/images/G/01/imdb/images/nopicture/156x231/tv-3797070466._CB522736147_.png@._V1_SX300.jpg" 

If I scrape it while in chrome browser console using Artoo.js bookmark. The Url stays same as original. Why is it changing when i use it in node?.Any Suggestions

I also posted on Stackoverflow

Update: I think I found the issue but not the solution. It seems the scraper method runs before the correct images have loaded on page. the changed URL is just the placeholder image. How can I wait till the entire page loads.

Yomguithereal commented 7 years ago

Well it depends how you fetch the page. Are you using http requests or are you emulating a browser like with phantomjs or pupeteer or electron?

jasan-s commented 7 years ago

@Yomguithereal I am just getting started with web scraping and this is the first tool i have used. I'm currently using request to make a get request. I suppose I need to use a a tool that emulates browser. Can you recommend one?

Yomguithereal commented 7 years ago

For a long time, the tool to use was PhantomJS. But since we now have a headless Chrome, I would recommend to try pupeteer.

jasan-s commented 7 years ago

@Yomguithereal tried Pupeteer and its awesome. Thanks for the recommendation. Now how can I use artoo js with Pupeteer. I tried by installing npm install artoo-js but its not working. I also posted a separate issue #271