medialab / sandcrawler

sandcrawler.js - the server-side scraping companion.
http://medialab.github.io/sandcrawler/
GNU Lesser General Public License v3.0
107 stars 12 forks source link

PhantomJS library integration - phridge/node-phantomjs #187

Open moshewe opened 8 years ago

moshewe commented 8 years ago

I tried to follow how the library works with the phantomJS process, and I got lost. Are you using any library to do that?

Yomguithereal commented 8 years ago

I am using bothan to do so and bothan uses the phantomjs dep itself.

moshewe commented 8 years ago

OK, now I understand what's going on. I really want to use this library, but I'm a little hesitant as not using a common phantom-nodejs bridge is a hard decision to defend when explaining this to our CTO... How does bothan differ from the above mentioned libs?

moshewe commented 8 years ago

+1 for the Star Wars reference, btw :)

Yomguithereal commented 8 years ago

I don't use a bridge such as those you mention because it does not let me use the phantom the way I really need to. Bothan provides a low-level access to the phantomjs child such as you can really script for phantomjs and not for node. Phantomjs has many issues such as memory leaks etc. that I wouldn't be able to contain (as much as possible) by using other higher-level bridges.

Yomguithereal commented 8 years ago

But keep in mind that all this code here is quite experimental and will be rewritten soon enough. One of the major problems with phantomjs is that it does not scale well and you constantly need to kill them and respawn them to avoid serious leaks which are inherent to phantomjs (less so with the 2.1 version, but still).

moshewe commented 8 years ago

I totally get you on that, I found myself switching to CSS selectors from XPaths because it would leak most of the time... I assume managing a phantom-spawn pool might do the trick, and use each spawn for two or three pages or so.

Please note the original phantomjs dep package is deprecated and has moved to something-prebuilt.

Yomguithereal commented 8 years ago

Please note the original phantomjs dep package is deprecated and has moved to something-prebuilt.

Yup. I just need time to rework on all of this soon.

Yomguithereal commented 8 years ago

I will also add an electron engine.

moshewe commented 8 years ago

Never heard of Electron before, thanks! Looks interesting!

Yomguithereal commented 8 years ago

There is also the jsdom option that can work for some simple cases.