luin / readability

📚 Turn any web page into a clean view
2.49k stars 312 forks source link

Added bodyText prop #68

Closed migreva closed 8 years ago

migreva commented 8 years ago

If a user only cares about the text on the page, they can use articleBody.textBody to get the full text of the article

luin commented 8 years ago

The content may be not wrapped with a <p></p> (e.g. <div></div>), so the result would miss some text.

What I suggest is to use innerText to get the content. For example:

var cheerio = require('cheerio');

read('http://colorlines.com/archives/2011/08/dispatch_from_angola_faith-based_slavery_in_a_louisiana_prison.html',
function(err, article) {
  console.log(cheerio.load(article.content).root().text());
});
migreva commented 8 years ago

That's not a bad way to do it, but it loads the DOM into a parser twice, i was trying to mitigate that for efficiency reasons.

Since read() already loads the html into jsdom and parses it, I was thinking that it could benefit from the first run at parsing and collect the text content from that.

I can update my PR to make sure it includes <div> and <p> tags, but only if it's something you want in the library. If it's not, I won't bother with the PR and just go about doing it the way you mentioned above

luin commented 8 years ago

That's make sense. I would accept this pull request if it supports other tags. Here's a reference of how cheerio implemented .text(): https://github.com/cheeriojs/cheerio/blob/master/lib/static.js#L108-L124. Notice that text() won't insert a \n for a block element, that's what we should fix.

migreva commented 8 years ago

Okay, check out this new one. I assume that the articleContent will be a root <div> element, I hope that's an okay assumption to make (https://github.com/luin/readability/pull/68/files#diff-f041f89560aea8bb4ef97786baa29b7eR111).

I then loop over all child nodes, no longer filtering by tag, and then only grab text when its a truthy value

luin commented 8 years ago

Looks good to me. Thank you for the pull request!

migreva commented 8 years ago

Thanks for the great library!