Open ThiemNguyen opened 9 years ago
I think it`s not a good algorithm. It fails on such simple page:
<html>
<head>
<title>Some title</title>
<link rel="canonical" href="http://example.org">
<meta property="og:image" content="http://example.org/thumbnail.png">
</head>
<body>
<div class="container">
<div class="content">
<div itemscope itemprop="http://schema.org/Article">
<h1 itemprop="name">Some title</h1>
<div itemprop="datePublished" datetime="2012-01-01T12:34:00">2012-01-01 12:34:00</div>
<p itemprop="articleBody">
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque vitae justo nec tortor tincidunt dictum in in libero. Maecenas tempus, leo in vulputate tempus, ipsum libero imperdiet lectus, a congue mauris ante sed nisl. Sed sit amet ultricies orci. Curabitur sed orci libero. In viverra mi non lacus accumsan venenatis. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vestibulum sit amet porttitor nulla, vel placerat tortor. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Interdum et malesuada fames ac ante ipsum primis in faucibus. Pellentesque maximus eu justo eu tincidunt. Fusce euismod, mauris vitae fringilla rutrum, dui nisl dictum est, egestas faucibus sapien ipsum vitae justo. Maecenas ac aliquet tellus. Vivamus libero neque, volutpat quis tempor vitae, auctor vitae sapien. Mauris ultricies semper lorem, eu cursus metus dignissim non. Vivamus bibendum sem sed iaculis maximus.
</p>
</div>
</div>
</div>
</body>
</html>
After extraction: article.cleaned_text == "" :(
Does you configure Latin language and stopwords for above example? :)
Hi everyone, I've been working with Goose for a couple of weeks in an attempt to utilize it in my project. I'm diving in the source code and trying out some improvements. The
top_node
property (the one containing big stuffs) of an extracted article seems to be calculated inContentExtractor::calculate_best_node
. AFAIK, it searches for p,pre,td elements, rejects ones with insufficient texts or high link density, then it walks through nodes_with_text to find out the best top node with a helper method calledis_boostable
. The problem is I cant understand these lines of codes (line 90-line 130):They are not documented yet. I did a lot of search on other source files, repo issues, even on original goose repo but still have not figured out an idea of how it works. And I found a case which the extractor failed to detect the top_node (it returned nothing): http://trendsread.com/articles/24
Any ideas? Thanks!