Closed migreva closed 8 years ago
The content may be not wrapped with a <p></p>
(e.g. <div></div>
), so the result would miss some text.
What I suggest is to use innerText
to get the content. For example:
var cheerio = require('cheerio');
read('http://colorlines.com/archives/2011/08/dispatch_from_angola_faith-based_slavery_in_a_louisiana_prison.html',
function(err, article) {
console.log(cheerio.load(article.content).root().text());
});
That's not a bad way to do it, but it loads the DOM into a parser twice, i was trying to mitigate that for efficiency reasons.
Since read()
already loads the html into jsdom and parses it, I was thinking that it could benefit from the first run at parsing and collect the text content from that.
I can update my PR to make sure it includes <div>
and <p>
tags, but only if it's something you want in the library. If it's not, I won't bother with the PR and just go about doing it the way you mentioned above
That's make sense. I would accept this pull request if it supports other tags. Here's a reference of how cheerio implemented .text()
: https://github.com/cheeriojs/cheerio/blob/master/lib/static.js#L108-L124. Notice that text()
won't insert a \n
for a block element, that's what we should fix.
Okay, check out this new one. I assume that the articleContent will be a root <div>
element, I hope that's an okay assumption to make (https://github.com/luin/readability/pull/68/files#diff-f041f89560aea8bb4ef97786baa29b7eR111).
I then loop over all child nodes, no longer filtering by tag, and then only grab text when its a truthy value
Looks good to me. Thank you for the pull request!
Thanks for the great library!
If a user only cares about the text on the page, they can use articleBody.textBody to get the full text of the article