ruipgil / scraperjs

A complete and versatile web scraper.
MIT License
3.7k stars 188 forks source link

Dynamic example fails on OpenShift instance #41

Closed rvernica closed 9 years ago

rvernica commented 9 years ago

I am using scraperjs@0.3.4 on an OpenShift instance. I am trying the examples on the README page. The static example works, but the dynamic one fails with a strange error. Any hints?

I would expect it might be due to the limited environment available on OpenShift instances and I wonder what is the cause so I can try to fix it.

> cat > static.js
var scraperjs = require('scraperjs');
scraperjs.StaticScraper.create('https://news.ycombinator.com/')
    .scrape(function($) {
        return $(".title a").map(function() {
            return $(this).text();
        }).get();
    }, function(news) {
        console.log(news);
    })
> node static.js 
[ 'Show HN: My SSH server knows who you are',
  'Show HN: JAWS – A JavaScript and AWS Stack',
  'Federal Judge Strikes Down Idaho ‘Ag-Gag Law’',
...
> cat > dynamic.js
var scraperjs = require('scraperjs');
scraperjs.DynamicScraper.create('https://news.ycombinator.com/')
    .scrape(function() {
        return $(".title a").map(function() {
            return $(this).text();
        }).get();
    }, function(news) {
        console.log(news);
    })
> node dynamic.js 

events.js:72
        throw er; // Unhandled 'error' event
              ^
Error: listen EACCES
    at errnoException (net.js:905:11)
    at Server._listen2 (net.js:1024:19)
    at listen (net.js:1065:10)
    at net.js:1147:9
    at asyncCallback (dns.js:68:16)
    at Object.onanswer [as oncomplete] (dns.js:121:9)
rvernica commented 9 years ago

It seems that scraperjs tries to open a port on the host. When I run the dynamic example on a different host I capture this with netstat:

tcp        0      0 127.0.0.1:45873         0.0.0.0:*               LISTEN      355/node

The port is a random port which an user account should be able to open but I guess this is not allowed on OpenShift instances. Could the library achieve the required functionality without opening a port?

rvernica commented 9 years ago

Actually it seems that ports can be open but only within this range [1]:

It is possible to bind to the internal IP with port range: 15000 - 35530.

rvernica commented 9 years ago

I think I fixed this. DynamicScraper uses Phantom. Phantom allows for an options argument where a port can be specified. I added an options argument to DynamicScraper where a port can be specified. For example:

scraperjs.DynamicScraper.create(url, {port: 29999})

See #42 for a pull request.

rvernica commented 9 years ago

For OpenShift, since it only allow for opening ports on the internal IPs [1], a complete example is:

scraperjs.DynamicScraper.create(url, {
    port: 29999,
    hostname: process.env.OPENSHIFT_NODEJS_IP || '127.0.0.1'
})
ruipgil commented 9 years ago

I'm glad that you were able to solve the issue. It's related with node-phantom. They need a network connection to communicate, hence the necessity for a port definition.