openaustralia / morph

Take the hassle out of web scraping
https://morph.io
GNU Affero General Public License v3.0
461 stars 74 forks source link

PhantomJS has been depreciated #1181

Closed jasonchanhku closed 6 years ago

jasonchanhku commented 6 years ago

Hi guys,

It seems PhantomJS has been depreciated and hence I can't have selenium in my scraper.py script. Would Chromedriver support or another alternative be considered ? Would appreciate any feedback. Thanks.

/app/.heroku/python/lib/python3.6/site-packages/selenium/webdriver/phantomjs/webdriver.py:49: UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless '

mlandauer commented 6 years ago

Yes I agree we should move over to something like headless Chrome on morph.io now that PhantomJS is officially being archived.

I think whatever the new thing is should be:

Are there any other things to consider?

dominikwilkowski commented 6 years ago

For archival purposes this is the announcement of phantomJS: ariya/phantomjs#15344

I'm not sure there is a software out there that supports all those languages and I'm also not sure that would be a good idea in the first place. I don't know enough about morph just yet to be super helpful but to the node community puppeteer which is headless webkit and SlimerJS which runs Gecko are the two that I would be looking into. For a scraper those two are more than enough to get you started with SPAs etc.

Here just an example of what it looks like to spin up puppeteer:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com');
  const height = await page.evaluate( () => document.documentElement.scrollHeight );
  // etc
  await browser.close();
})();
mlandauer commented 6 years ago

@jasonchanhku @dominikwilkowski thanks for the poke to start moving away from PhantomJS.

morph.io now supports Google Chrome headless which you can either use directly or use via webdriver. The documentation is super-sparse right now. See https://morph.io/documentation/scraping_javascript_sites. If you would be interested in helping out with the documentation that would be amazing.

@dominikwilkowski perhaps you would consider writing some documentation (and maybe an example scraper) for nodejs that uses puppeteer?

jasonchanhku commented 6 years ago

Thanks ! You guys are so efficient ! Cheers and happy easter