grangier / python-goose

Html Content / Article Extractor, web scrapping lib in Python
Apache License 2.0
3.98k stars 787 forks source link

top_node algorithm? (test case included) #221

Open ThiemNguyen opened 9 years ago

ThiemNguyen commented 9 years ago

Hi everyone, I've been working with Goose for a couple of weeks in an attempt to utilize it in my project. I'm diving in the source code and trying out some improvements. The top_node property (the one containing big stuffs) of an extracted article seems to be calculated in ContentExtractor::calculate_best_node. AFAIK, it searches for p,pre,td elements, rejects ones with insufficient texts or high link density, then it walks through nodes_with_text to find out the best top node with a helper method called is_boostable. The problem is I cant understand these lines of codes (line 90-line 130):

        nodes_number = len(nodes_with_text)
        negative_scoring = 0
        bottom_negativescore_nodes = float(nodes_number) * 0.25

        for node in nodes_with_text:
            boost_score = float(0)
            # boost
            if(self.is_boostable(node)):
                if cnt >= 0:
                    boost_score = float((1.0 / starting_boost) * 50)
                    starting_boost += 1
            # nodes_number
            if nodes_number > 15:
                if (nodes_number - i) <= bottom_negativescore_nodes:
                    booster = float(bottom_negativescore_nodes - (nodes_number - i))
                    boost_score = float(-pow(booster, float(2)))
                    negscore = abs(boost_score) + negative_scoring
                    if negscore > 40:
                        boost_score = float(5)

            text_node = self.parser.getText(node)
            word_stats = self.stopwords_class(language=self.get_language()).get_stopword_count(text_node)
            upscore = int(word_stats.get_stopword_count() + boost_score)

            # parent node
            parent_node = self.parser.getParent(node)
            self.update_score(parent_node, upscore)
            self.update_node_count(parent_node, 1)

            if parent_node not in parent_nodes:
                parent_nodes.append(parent_node)

            # parentparent node
            parent_parent_node = self.parser.getParent(parent_node)
            if parent_parent_node is not None:
                self.update_node_count(parent_parent_node, 1)
                self.update_score(parent_parent_node, upscore / 2)
                if parent_parent_node not in parent_nodes:
                    parent_nodes.append(parent_parent_node)
            cnt += 1
            i += 1

They are not documented yet. I did a lot of search on other source files, repo issues, even on original goose repo but still have not figured out an idea of how it works. And I found a case which the extractor failed to detect the top_node (it returned nothing): http://trendsread.com/articles/24

Any ideas? Thanks!

vetal4444 commented 9 years ago

I think it`s not a good algorithm. It fails on such simple page:

<html>
<head>
    <title>Some title</title>
    <link rel="canonical" href="http://example.org">
    <meta property="og:image" content="http://example.org/thumbnail.png">
</head>
<body>
    <div class="container">
        <div class="content">

        <div itemscope itemprop="http://schema.org/Article">
            <h1 itemprop="name">Some title</h1>
            <div itemprop="datePublished" datetime="2012-01-01T12:34:00">2012-01-01 12:34:00</div>
            <p itemprop="articleBody">
                Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque vitae justo nec tortor tincidunt dictum in in libero. Maecenas tempus, leo in vulputate tempus, ipsum libero imperdiet lectus, a congue mauris ante sed nisl. Sed sit amet ultricies orci. Curabitur sed orci libero. In viverra mi non lacus accumsan venenatis. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vestibulum sit amet porttitor nulla, vel placerat tortor. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Interdum et malesuada fames ac ante ipsum primis in faucibus. Pellentesque maximus eu justo eu tincidunt. Fusce euismod, mauris vitae fringilla rutrum, dui nisl dictum est, egestas faucibus sapien ipsum vitae justo. Maecenas ac aliquet tellus. Vivamus libero neque, volutpat quis tempor vitae, auctor vitae sapien. Mauris ultricies semper lorem, eu cursus metus dignissim non. Vivamus bibendum sem sed iaculis maximus.
            </p>
        </div>

        </div>
    </div>
</body>
</html>

After extraction: article.cleaned_text == "" :(

muggot commented 9 years ago

Does you configure Latin language and stopwords for above example? :)