matthewmueller / x-ray

The next web scraper. See through the <html> noise.
MIT License
5.87k stars 350 forks source link

Better error catching and recovery #84

Open kengz opened 9 years ago

kengz commented 9 years ago

I came across a url that apparently cannot be crawled. Below is the sample code


x('https://beap.gemini.yahoo.com/mbclk?bv=1.0.0&es=gWe3t6gGIS.s0yEhy1nskxLUiJSvhpYTFaxji9iVUTQ3iLFRut_mRIYgTyb7DK5m4ametAwu.DJl1Qbh20LC4zvTwtXnDF0j0mRGcQWCkWfGL5l6Gk4fq9dUChtwa4UpPF42Z.wVksLAQvRWGbeXLXmXyD6gnmDjqf0JmAtK2u.BeMBy_shjIDAOsBqy3tKclgt_aSMCwSRRMt92H0nhlhciKaO0lZkFyF_lYLE8TI2SEJM8ZFoiL2dfcCkEKXIgiNsH9Hy0_BsEPvEU1fo.1Kc9QYCWS72xw19a0eJ0hIDMbG5VQG.XDxeyefruE5gojjYnTLjzBdMEr3kH9svkz0TV7IyFIT_d_goLiQ78eaoTbfQJzRAF7vlmfHYocyQ1Sf1jFJPQRAb1uv5YUDcORj4CZ6XX56Y0xPXcWyfqi9RwbRI42_5sTS16sRGM7tRkDd7los2L7wKuH9egecgyTSDuMv0rO7gzTnWQ.ur4vTFV8XvReE5TKRkMXo89aaccRTcS8h72HxDTpraBQtjNbKaE5lZBdwMVZnSrQlfSUjjCNZfiaTfb4dM-%26lp=', 'title')(function(err, title) {
  console.log(title) 
})

which throws the error

/Users/kengz/Google Drive/Quiver/node_modules/reqscraper/node_modules/x-ray/index.js:166
      var $ = html.html ? html : cheerio.load(html);
                  ^
TypeError: Cannot read property 'html' of null
    at load (/Users/kengz/Google Drive/Quiver/node_modules/reqscraper/node_modules/x-ray/index.js:166:19)
    at /Users/kengz/Google Drive/Quiver/node_modules/reqscraper/node_modules/x-ray/index.js:85:19
    at /Users/kengz/Google Drive/Quiver/node_modules/reqscraper/node_modules/x-ray/index.js:248:14
    at _done (/Users/kengz/Google Drive/Quiver/node_modules/reqscraper/node_modules/x-ray/node_modules/x-ray-crawler/node_modules/enqueue/index.js:78:20)
    at _once (/Users/kengz/Google Drive/Quiver/node_modules/reqscraper/node_modules/x-ray/node_modules/x-ray-crawler/node_modules/enqueue/index.js:93:15)
    at result (/Users/kengz/Google Drive/Quiver/node_modules/reqscraper/node_modules/x-ray/node_modules/x-ray-crawler/lib/index.js:107:7)
    at /Users/kengz/Google Drive/Quiver/node_modules/reqscraper/node_modules/x-ray/node_modules/x-ray-crawler/node_modules/wrap-fn/index.js:121:18
    at /Users/kengz/Google Drive/Quiver/node_modules/reqscraper/node_modules/x-ray/node_modules/x-ray-crawler/lib/http-driver.js:42:16
    at Request.callback (/Users/kengz/Google Drive/Quiver/node_modules/reqscraper/node_modules/x-ray/node_modules/x-ray-crawler/node_modules/superagent/lib/node/index.js:797:3)
    at IncomingMessage.<anonymous> (/Users/kengz/Google Drive/Quiver/node_modules/reqscraper/node_modules/x-ray/node_modules/x-ray-crawler/node_modules/superagent/lib/node/index.js:990:12)
[Finished in 1.4s with exit code 1]

I tried to wrap a try-catch around it to allow for recovery (so that it doesn't just crash), to no avail.

Any clue what's causing the first in in the error (with cheerio), and if it can be caught and recovered?

37 commented 9 years ago

On my end that url is bouncing and redirecting to 'http://lp.canadianvisaexpert.com/newg_lp/Canada/CanadianLP?utm_term=lp-canadianlp&af=can_2349_[yoursubidparameter]&utm_clickid=',

So this could be something to do with it, if this url is indeed the one you're trying to scrape why not just use that?

kengz commented 9 years ago

I'm actually scraping Yahoo news en masse. That link shares the same html attributes with the normal news urls, so when scraping and encountering this it crashes my code.