luin / readability

📚 Turn any web page into a clean view
2.48k stars 313 forks source link

Empty "HTML" causes massive issues #15

Closed kopertop closed 10 years ago

kopertop commented 10 years ago

When reading certain URLs, the body returns empty, which I believe is because of being blocked by the provider. When this happens, instead of an error being returned, an exception is raised by jsdom, because the empty HTML object is passed right into it.

Simple STR:

var readability = require('node-readability');
var url = 'http://dotearth.blogs.nytimes.com/2013/11/21/did-90-companies-cause-the-climate-crisis-of-the-21st-century/';
readability.read(url, { timeout: 5000 }, function(err, article) {
   // It will never reach this point
   console.log(err, article);
});

Adding this line to line 94 of readability.js solves the issue (although it doesn't fix not being able to read the URL).

    if (typeof body !== 'string') body = body.toString();
    if (!body) return callback('No Body Found');

I can make this into a pull request if needed, but I'm not sure what the deeper issue is where these URLs aren't readable.

luin commented 10 years ago

Pull request is welcome because we should handle this error :-)