matthewmueller / x-ray

The next web scraper. See through the <html> noise.
MIT License
5.87k stars 349 forks source link

paginate + JSON issues #336

Closed leonardodino closed 5 years ago

leonardodino commented 5 years ago

Stream causes JSON errors on the last page being crawled.

the paginate method is throwing when I try to go past the last page.

the website I'm scrapping always display a "next page", and that last one is empty, without the link.

the "json" being passed to the lib/promisify function is this one:

[

  {
    "id": "1121315593293041666"
  }
,]

It's being invalidated by the trailing comma.

Your environment

Steps to reproduce

I'm using await x(url, context, [selector]).paginate(nextPageSelector)

I've created an example reproduction: https://github.com/leonardodino/x-ray-repro-336 (it's a different use-case, but outlines the same behaviour)

There's a bit of ceremony wrapping x-ray to address #339

Expected behaviour

Actual behaviour

The trailing comma breaks the JSON parsing.

Unhandled rejection SyntaxError: Unexpected token ] in JSON at position 44
SyntaxError: Unexpected token ] in JSON at position 44
    at JSON.parse (<anonymous>)
    at /Users/leonardodino/Sites/redacted/node_modules/x-ray/lib/promisify.js:28:24
    at Readable.<anonymous> (/Users/leonardodino/Sites/redacted/node_modules/stream-to-string/index.js:18:13)
    at emitNone (events.js:106:13)
    at Readable.emit (events.js:208:7)
    at endReadableNT (_stream_readable.js:1064:12)
    at _combinedTickCallback (internal/process/next_tick.js:139:11)
    at process._tickDomainCallback (internal/process/next_tick.js:219:9)
From previous event:
    at streamToPromise (/Users/leonardodino/Sites/redacted/node_modules/x-ray/lib/promisify.js:22:10)
    at Function.node.then (/Users/leonardodino/Sites/redacted/node_modules/x-ray/index.js:186:14)
    at process._tickDomainCallback (internal/process/next_tick.js:229:7)