ruipgil / scraperjs

A complete and versatile web scraper.
MIT License
3.7k stars 188 forks source link

requests@2.62.0 breaks DynamicScraper #43

Open brunofiorentino opened 8 years ago

brunofiorentino commented 8 years ago

Updating requests module from 2.61 to 2.62 (allowed by semver) breaks DynamicScraper with 'Unhandled stream error in pipe.' error message. Besides by application code it breaks the simpler DynamicScraper Hacker News sample too.

In order to obtain an older and working module tree I tried 'npm shrinkwrap' at first, but it thrown an error about phantomjs inner dependencies, so I removed ./node_modules/scraperjs/node_modules/, pinned requests@2.61.0 in ./node_modules/scraperjs/package.json and ran npm install inside it. DynamicScraper started working again.

ruipgil commented 8 years ago

Do you have node 4.x? It's currently unsupported.

All unit tests pass with request 2.62.0 (without node 4.x).

brunofiorentino commented 8 years ago

I'm not using node 4.0. I'm running node 0.12.7, installed via nvm and tried on booth OSX 10.10 and Ubuntu 15.04. That's my OSX terminal running the hacker news dynamic scraper sample with requests 2.62:

brunao-ux1:scrapers brunao$ node -v
v0.12.7
brunao-ux1:scrapers brunao$ cat hn-sample.js
var scraperjs = require('scraperjs');
scraperjs.DynamicScraper.create('https://news.ycombinator.com/')
    .scrape(function() {
        return $(".title a").map(function() {
            return $(this).text();
        }).get();
    }, function(news) {
        console.log(news);
    })
brunao-ux1:scrapers brunao$ node hn-sample.js

stream.js:94
      throw er; // Unhandled stream error in pipe.
            ^
Error
    at Object.<anonymous> (/Users/brunao/cdng/f7e/pecaspara/scrapers/node_modules/scraperjs/src/ScraperError.js:32:26)
    at Module._compile (module.js:460:26)
    at Object.Module._extensions..js (module.js:478:10)
    at Module.load (module.js:355:32)
    at Function.Module._load (module.js:310:12)
    at Module.require (module.js:365:17)
    at require (module.js:384:17)
    at Object.<anonymous> (/Users/brunao/cdng/f7e/pecaspara/scrapers/node_modules/scraperjs/src/DynamicScraper.js:5:17)
    at Module._compile (module.js:460:26)
    at Object.Module._extensions..js (module.js:478:10)
brunofiorentino commented 8 years ago

Also, some dynamicScraper related tests failed on booth OSX and Ubuntu too:

Running "clean:coverage" (clean) task
>> 1 path cleaned.

Running "jshint:all" (jshint) task
>> 16 files lint free.

Running "serve" task
Listening on port 3000

Running "exec:coverage" (exec) task



  ✓ ScraperError 
  AbstractScraper

    ✓ get 

    ✓ request 

    ✓ getStatusCode 

    ✓ getResponse 

    ✓ getBody 

    ✓ loadBody 

    ✓ scrape 

    ✓ close 

    ✓ clone 

  DynamicScraper

    ✓ .loadBody, .scrape, .close (1127ms)

    ✓ .clone 

    ✓ #startFactory, #closeFactory 
    #create

      ✓ with argument (1275ms)

      ✓ without argument (1094ms)
    .inject

      ✓ page not loaded 

      ✓ success (1092ms)

      ✓ fails (1087ms)

      ✓ fails jQuery (1073ms)

  Router

    ✓ get 

    ✓ request 

    ✓ otherwise 

    ✓ route 

    ✓ createStatic 

    ✓ createDynamic (1199ms)

    ✓ use 

    ✓ usage of params 
    #pathMatcher

      ✓ with string 

      ✓ with regular expression 
    on

      ✓ with path 

      ✓ with function 
    instantiation

      ✓ with firstMatch 

      ✓ without firstMatch 
    bad formatting

      ✓ get 

      ✓ request 

      ✓ createStatic 

      ✓ createDynamic 

      ✓ use 

  Scraper Promise
    with StaticScraper

      ✓ timeout (109ms)

      ✓ then 

      ✓ then 

      ✓ onError 

      ✓ error without onError 

      ✓ delay (109ms)

      ✓ request 

      ✓ done 

      ✓ _setChainParameter 

      ✓ _setPromises 

      ✓ clone 

      ✓ passing values between promises 
      onStatusCode

        ✓ with code 

        ✓ without code 
      scrape

        ✓ without extra arguments 

        ✓ without extra arguments 

        ✓ with only the scraping function 

        ✓ with error 
      _fire

        ✓ without error 

        ✓ with error 
      usage of utils

        ✓ stop() 

        ✓ scraper 

        ✓ params 
    with DynamicScraper
      with Factory

        ✓ timeout (1176ms)

        ✓ then (1086ms)

        ✓ then (1078ms)

        ✓ onError (1093ms)

        ✓ delay (1201ms)

        ✓ request (1080ms)

        ✓ done (1195ms)

        ✓ _setChainParameter 

        ✓ _setPromises 

        ✓ clone 

        ✓ passing values between promises (1745ms)
        onStatusCode

          ✓ with code (1081ms)

          ✓ without code (1086ms)
        scrape

          1) without extra arguments

          2) without extra arguments

          3) with only the scraping function

          ✓ with error (1089ms)
        _fire

          ✓ without error 

          ✓ with error 
        usage of utils

          ✓ stop() (1600ms)

          ✓ scraper (1087ms)

          ✓ params (1083ms)
      without Factory

        ✓ timeout (1189ms)

        ✓ then (1080ms)

        ✓ then (1091ms)

        ✓ onError (1085ms)

        ✓ delay (1188ms)

        ✓ request (1082ms)

        ✓ done (1083ms)

        ✓ _setChainParameter 

        ✓ _setPromises 

        ✓ clone 

        ✓ passing values between promises (1116ms)
        onStatusCode

          ✓ with code (1185ms)

          ✓ without code (1204ms)
        scrape

          ✓ without extra arguments (1299ms)

          4) without extra arguments

          ✓ with only the scraping function (1162ms)

          ✓ with error (1169ms)
        _fire

          ✓ without error 

          ✓ with error 
        usage of utils

          ✓ stop() (1259ms)

          ✓ scraper (1193ms)

          ✓ params (1501ms)

  StaticScraper

    ✓ .clone 
    #create

      ✓ with argument 

      ✓ without argument 
    .loadBody, .scrape, .close

      ✓ without errors 

      ✓ with errors 

  106 passing (49s)
>>   4 failing
>> 
>>   1) Scraper Promise with DynamicScraper with Factory scrape without extra arguments:
>>      Uncaught AssertionError: 10 == 9
>>       at /Users/brunao/cdng/f7e/scraperjs/test/ScraperPromise.js:130:12
>>       at /Users/brunao/cdng/f7e/scraperjs/src/ScraperPromise.js:9:2689
>>       at /Users/brunao/cdng/f7e/scraperjs/src/DynamicScraper.js:9:2390
>>       at Proto.apply (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/dnode/node_modules/dnode-protocol/index.js:123:13)
>>       at Proto.handle (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/dnode/node_modules/dnode-protocol/index.js:99:19)
>>       at D.dnode.handle (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/dnode/lib/dnode.js:140:21)
>>       at D.dnode.write (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/dnode/lib/dnode.js:128:22)
>>       at SockJSConnection.ondata (stream.js:51:26)
>>       at SockJSConnection.emit (events.js:107:17)
>>       at Session.didMessage (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/shoe/node_modules/sockjs/lib/transport.js:220:25)
>>       at WebSocketReceiver.didMessage (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/shoe/node_modules/sockjs/lib/trans-websocket.js:102:40)
>>       at /Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/shoe/node_modules/sockjs/lib/trans-websocket.js:75:22
>>       at null.<anonymous> (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/shoe/node_modules/sockjs/node_modules/faye-websocket/lib/faye/websocket/api/event_target.js:41:7)
>>       at Array.forEach (native)
>>       at EventTarget.dispatchEvent (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/shoe/node_modules/sockjs/node_modules/faye-websocket/lib/faye/websocket/api/event_target.js:40:33)
>>       at API.receive (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/shoe/node_modules/sockjs/node_modules/faye-websocket/lib/faye/websocket/api.js:30:10)
>>       at instance.parse (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/shoe/node_modules/sockjs/node_modules/faye-websocket/lib/faye/websocket/draft75_parser.js:56:26)
>>       at Draft76Parser.parse (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/shoe/node_modules/sockjs/node_modules/faye-websocket/lib/faye/websocket/draft76_parser.js:77:42)
>>       at Socket.<anonymous> (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/shoe/node_modules/sockjs/node_modules/faye-websocket/lib/faye/websocket.js:72:33)
>>       at Socket.emit (events.js:107:17)
>>       at readableAddChunk (_stream_readable.js:163:16)
>>       at Socket.Readable.push (_stream_readable.js:126:10)
>>       at TCP.onread (net.js:538:20)
>> 
>>   2) Scraper Promise with DynamicScraper with Factory scrape without extra arguments:
>>      Uncaught AssertionError: 10 == 9
>>       at /Users/brunao/cdng/f7e/scraperjs/test/ScraperPromise.js:154:12
>>       at /Users/brunao/cdng/f7e/scraperjs/src/ScraperPromise.js:9:2689
>>       at /Users/brunao/cdng/f7e/scraperjs/src/DynamicScraper.js:9:2390
>>       at Proto.apply (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/dnode/node_modules/dnode-protocol/index.js:123:13)
>>       at Proto.handle (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/dnode/node_modules/dnode-protocol/index.js:99:19)
>>       at D.dnode.handle (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/dnode/lib/dnode.js:140:21)
>>       at D.dnode.write (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/dnode/lib/dnode.js:128:22)
>>       at SockJSConnection.ondata (stream.js:51:26)
>>       at SockJSConnection.emit (events.js:107:17)
>>       at Session.didMessage (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/shoe/node_modules/sockjs/lib/transport.js:220:25)
>>       at WebSocketReceiver.didMessage (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/shoe/node_modules/sockjs/lib/trans-websocket.js:102:40)
>>       at /Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/shoe/node_modules/sockjs/lib/trans-websocket.js:75:22
>>       at null.<anonymous> (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/shoe/node_modules/sockjs/node_modules/faye-websocket/lib/faye/websocket/api/event_target.js:41:7)
>>       at Array.forEach (native)
>>       at EventTarget.dispatchEvent (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/shoe/node_modules/sockjs/node_modules/faye-websocket/lib/faye/websocket/api/event_target.js:40:33)
>>       at API.receive (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/shoe/node_modules/sockjs/node_modules/faye-websocket/lib/faye/websocket/api.js:30:10)
>>       at instance.parse (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/shoe/node_modules/sockjs/node_modules/faye-websocket/lib/faye/websocket/draft75_parser.js:56:26)
>>       at Draft76Parser.parse (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/shoe/node_modules/sockjs/node_modules/faye-websocket/lib/faye/websocket/draft76_parser.js:77:42)
>>       at Socket.<anonymous> (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/shoe/node_modules/sockjs/node_modules/faye-websocket/lib/faye/websocket.js:72:33)
>>       at Socket.emit (events.js:107:17)
>>       at readableAddChunk (_stream_readable.js:163:16)
>>       at Socket.Readable.push (_stream_readable.js:126:10)
>>       at TCP.onread (net.js:538:20)
>> 
>>   3) Scraper Promise with DynamicScraper with Factory scrape with only the scraping function:
>>      Uncaught AssertionError: 10 == 9
>>       at /Users/brunao/cdng/f7e/scraperjs/test/ScraperPromise.js:180:12
>>       at then (/Users/brunao/cdng/f7e/scraperjs/src/ScraperPromise.js:9:4281)
>>       at dispatcher (/Users/brunao/cdng/f7e/scraperjs/src/ScraperPromise.js:9:7483)
>>       at /Users/brunao/cdng/f7e/scraperjs/node_modules/async/lib/async.js:187:20
>>       at iterate (/Users/brunao/cdng/f7e/scraperjs/node_modules/async/lib/async.js:265:13)
>>       at /Users/brunao/cdng/f7e/scraperjs/node_modules/async/lib/async.js:277:29
>>       at /Users/brunao/cdng/f7e/scraperjs/node_modules/async/lib/async.js:44:16
>>       at done (/Users/brunao/cdng/f7e/scraperjs/src/ScraperPromise.js:9:7171)
>>       at /Users/brunao/cdng/f7e/scraperjs/src/ScraperPromise.js:9:2684
>>       at /Users/brunao/cdng/f7e/scraperjs/src/DynamicScraper.js:9:2390
>>       at Proto.apply (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/dnode/node_modules/dnode-protocol/index.js:123:13)
>>       at Proto.handle (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/dnode/node_modules/dnode-protocol/index.js:99:19)
>>       at D.dnode.handle (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/dnode/lib/dnode.js:140:21)
>>       at D.dnode.write (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/dnode/lib/dnode.js:128:22)
>>       at SockJSConnection.ondata (stream.js:51:26)
>>       at SockJSConnection.emit (events.js:107:17)
>>       at Session.didMessage (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/shoe/node_modules/sockjs/lib/transport.js:220:25)
>>       at WebSocketReceiver.didMessage (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/shoe/node_modules/sockjs/lib/trans-websocket.js:102:40)
>>       at /Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/shoe/node_modules/sockjs/lib/trans-websocket.js:75:22
>>       at null.<anonymous> (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/shoe/node_modules/sockjs/node_modules/faye-websocket/lib/faye/websocket/api/event_target.js:41:7)
>>       at Array.forEach (native)
>>       at EventTarget.dispatchEvent (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/shoe/node_modules/sockjs/node_modules/faye-websocket/lib/faye/websocket/api/event_target.js:40:33)
>>       at API.receive (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/shoe/node_modules/sockjs/node_modules/faye-websocket/lib/faye/websocket/api.js:30:10)
>>       at instance.parse (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/shoe/node_modules/sockjs/node_modules/faye-websocket/lib/faye/websocket/draft75_parser.js:56:26)
>>       at Draft76Parser.parse (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/shoe/node_modules/sockjs/node_modules/faye-websocket/lib/faye/websocket/draft76_parser.js:77:42)
>>       at Socket.<anonymous> (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/shoe/node_modules/sockjs/node_modules/faye-websocket/lib/faye/websocket.js:72:33)
>>       at Socket.emit (events.js:107:17)
>>       at readableAddChunk (_stream_readable.js:163:16)
>>       at Socket.Readable.push (_stream_readable.js:126:10)
>>       at TCP.onread (net.js:538:20)
>> 
>>   4) Scraper Promise with DynamicScraper without Factory scrape without extra arguments:
>>      Uncaught AssertionError: 10 == 9
>>       at /Users/brunao/cdng/f7e/scraperjs/test/ScraperPromise.js:154:12
>>       at /Users/brunao/cdng/f7e/scraperjs/src/ScraperPromise.js:9:2689
>>       at /Users/brunao/cdng/f7e/scraperjs/src/DynamicScraper.js:9:2390
>>       at Proto.apply (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/dnode/node_modules/dnode-protocol/index.js:123:13)
>>       at Proto.handle (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/dnode/node_modules/dnode-protocol/index.js:99:19)
>>       at D.dnode.handle (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/dnode/lib/dnode.js:140:21)
>>       at D.dnode.write (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/dnode/lib/dnode.js:128:22)
>>       at SockJSConnection.ondata (stream.js:51:26)
>>       at SockJSConnection.emit (events.js:107:17)
>>       at Session.didMessage (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/shoe/node_modules/sockjs/lib/transport.js:220:25)
>>       at WebSocketReceiver.didMessage (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/shoe/node_modules/sockjs/lib/trans-websocket.js:102:40)
>>       at /Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/shoe/node_modules/sockjs/lib/trans-websocket.js:75:22
>>       at null.<anonymous> (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/shoe/node_modules/sockjs/node_modules/faye-websocket/lib/faye/websocket/api/event_target.js:41:7)
>>       at Array.forEach (native)
>>       at EventTarget.dispatchEvent (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/shoe/node_modules/sockjs/node_modules/faye-websocket/lib/faye/websocket/api/event_target.js:40:33)
>>       at API.receive (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/shoe/node_modules/sockjs/node_modules/faye-websocket/lib/faye/websocket/api.js:30:10)
>>       at instance.parse (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/shoe/node_modules/sockjs/node_modules/faye-websocket/lib/faye/websocket/draft75_parser.js:56:26)
>>       at Draft76Parser.parse (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/shoe/node_modules/sockjs/node_modules/faye-websocket/lib/faye/websocket/draft76_parser.js:77:42)
>>       at Socket.<anonymous> (/Users/brunao/cdng/f7e/scraperjs/node_modules/phantom/node_modules/shoe/node_modules/sockjs/node_modules/faye-websocket/lib/faye/websocket.js:72:33)
>>       at Socket.emit (events.js:107:17)
>>       at readableAddChunk (_stream_readable.js:163:16)
>>       at Socket.Readable.push (_stream_readable.js:126:10)
>>       at TCP.onread (net.js:538:20)
>> 
>> 
>> =============================================================================
>> Writing coverage object [/Users/brunao/cdng/f7e/scraperjs/coverage/coverage.json]

>> Writing coverage reports at [/Users/brunao/cdng/f7e/scraperjs/coverage]
>> =============================================================================

=============================== Coverage summary ===============================
Statements   : 99.7% ( 333/334 )
Branches     : 100% ( 111/111 )
Functions    : 98.15% ( 106/108 )
Lines        : 100% ( 333/333 )
================================================================================
>> Exited with code: 4.
Warning: Task "exec:coverage" failed. Use --force to continue.

Aborted due to warnings.
ruipgil commented 8 years ago

I'm also running node v0.12.7 on OSX 10.10. Try to download and install a fresh copy of scraperjs and check if the installation is error free.

brunofiorentino commented 8 years ago

That's what I've done. In the application directory I removed ./node_modules and ran npm install several times (including the root dir) while trying to make things work. And to obtain the test output above: I "git cloned" scraperjs in another directory and then ran grunt test.

Considering our systems are equal, do you think some global state (node or system wide) might compromise the library? Phanthom, for instance, must be installed with npm install -g.