Implement boilerplate scrubbing

jfroelich commented 10 years ago

Distinguish between useful content and boilerplate.

Doing some prelim research:

Boilerplate detection using shallow text features. http://www.decom.ufop.br/menotti/rp122/sem/sem1-brayan-milton-art.pdf
Readability. https://code.google.com/p/arc90labs-readability/source/browse/trunk/js/readability.js
jusText. https://justext.googlecode.com/svn/trunk/justext/core.py
http://prezi.com/bammeubeb6yd/automatic-identification-of-informative-sections-of-webpages/
Agglomerative clustering - https://code.google.com/p/figue/source/browse/trunk/figue.js
Goldminer. http://polibits.gelbukh.com/2013_48/More%20Effective%20Boilerplate%20Removal%20-%20the%20GoldMiner%20Algorithm.pdf. Site wide template approach.
Removing Boilerplate and Duplicate Content from Web Corpora. http://is.muni.cz/th/45523/fi_d/phdthesis.pdf.
VIPS: a Vision-based Page Segmentation Algorithm. http://research.microsoft.com/pubs/70027/tr-2003-79.pdf

jfroelich commented 10 years ago

Notes on Boilerplate Detection using Shallow Text Features:

Simply analyzing basic features such as the number of words in a page and link density can achieve satisfactory accuracy that is comparable to more intricate statistical analyses.
One interesting prior heuristic determines that the largest contiguous text area with the least amount of HTML tags is the full text, based on the observations that the tag density within boilerplate text is higher than within full-text content and that main content is usually longer than boilerplate text.
Considering the visual render of the page (e.g. examining CSS) is too computationally intensive.
Finding a per-site template is too domain specific and too computationally intensive.
Frequently used boilerplate words across the entire corpus (web) is OK.
Focus primarily on headers, paragraphs, divs, and anchors.
Rather than look at words at a deeper semantic level, look at more shallow features that are language and domain independent: average word length, average sentence length, number of words.
Also examine context (neighboring sections). Boilerplate sections end to follow boilerplate sections. Content sections tend to follow content sections. Main content is usually surrounded by boilerplate.
Other interesting features
The number of proper case or all uppercase words
The ratio of proper/upper case words to the total number of words
The ratio of full stops (sentence breaks) to the total number of words
Number of date time features
Number of pipe symbols ( | )
Link density (anchor percentage): the ratio of the number of tokens within an anchor to the number of tokens within the block (the A itself is considered an inline tag and does not? designate a block).
The text density of each particular block. Not quite sure what this is. Section 3.4. Essentially split the text into lines (segments of about 80 to 90 chars) and it is ratio of number of tokens to number of lines, with a special case for the last line (or only line).
The approach: generate a list of blocks, then classify each block as either boilerplate or content. Inline elements (the A tag) are treated as a special case.
Statistical analysis showed that relative position (e.g. block number from 0), average word length (sum of n chars in word / n words), and number of words (in block) as strong indicators.
Looking at block sequence provided even better feature (if content block followed content block)
Any block with a text density less than 10.5 is regarded boilerplate...then further research suggested to accept all blocks that have a minimum text density of 7 and a maximum link density of 0.35
The very simple strategy of keeping all blocks with at least 10 words performed just as well as more advanced techniques
Our classifier based on the number of words per block and its link density yields improve the baseline by 33.3%
Can define additional heuristics like LargestContentFilter (identify the one block with the most text) or the MainContentFilter (LargestContent filter excluding comments sections that appear below news).
The classifier using Number of Words per Block performed slightly better than the classifier using Text Density
Three classes of text emerge: the main text with text density > 10, the remaining low density blocks as boilerplate, and a mixed class dominated by hypertext,
For the web, there is a strong correlation between short text and boilerplate text as well as between long text and content text.
Looking at text density mirrors how authors produce content linguistically. Split text into lines (e.g. every 80 chars). An incomplete sentence generally never wraps to more than one line, in which case, text density is equivalent to n words. Whereas text consisting of complete sentences will always wrap, be averaged to typical number of words in a sentence, and encoded then as a density value of limited range.

jfroelich commented 10 years ago

Actually, we should generate the DOM here. Grouping should essentially involve building a new DOM from the blocks. We then score each dom node, and apply scores to container-type dom nodes (div/ul/ol/p/table/tr).

Then we find the best container. So we are not trying to find bounds of blocks in the list, but the best container in the hierarchy.

Each container's score should reflect its block scores, in aggregate.

We should allow for negative block scores in the scoring section so that blocks can negatively influence the container score.

We don't need to filter low scoring blocks when rebuilding the hierarchy. The filtering is a natural side effect of choosing the best container. It also avoids some of the need to reinsert intervening boilerplate.

We still probably need to extend each container to its siblings at the same depth, and recalculate the container's score after the merging of siblings. Not 100% sure what the criteria is for merging. It could be whether the score increased or decreased above or below some threshold.

Note: we should refactor the biases to allow for negative scoring. We want score magnitude. For example, bad link density should not only not increase the score, it should actually decrease it.

Previous note: the extend method should consider hierarchical position, not simply block index distance. Maybe what we want is something analogous to agglomerative clustering. What we are doing is defining our own distance function where two blocks are similar if they share the same approximate depth. The path between the blocks is based on the number of traversals up the hierarchy and then back down from the common ancestor of any two blocks. Paths are analogous to axes in xPath. Maybe we can also think of them as vectors into the hierarchical structure?

Then there is the question of how to rebuild the hierarchy. What we could do is use the original input hierarchy and scrub it in place, augmenting it with content scores and filtering out low scores. That requires a rewrite of this entire module because it is fundamentally different. It is also top down. One of the negative factors is that the original hierarchy is subject to the control of the author and not calamine, which means we get a junk hierarchy full of useless containers. For example, div wrappers for CSS and other layout techniques where the purpose of the container is purely for layout and not for designating main content. Furthermore, we don't even want to consider all of the nodes in the original hierarchy. There are certain nodes we can plainly ignore (meta, style, head, etc.). We also have to scrub the original attributes except for the ones we want.

Note another difficulty with the original hierarchy is that we need to be able to modify it in place. I solved this issue with the way sanitize was built to do in place mutation during iteration by doing a look-ahead for the node to iterate to next based on the mutation operation (keep node, remove node, unwrap node, etc). I am not sure I ever solved how to do splitting. I already know that I need to do this, because I want to consider BR elements as resulting in two blocks.

One other difficulty is the liveliness of the input HTMLDocument object. We need to be able to operate under the assumption the original document is not important anymore, because we are going to be mutating it. This works if the doc is created via document.implementation, but I am not sure how it would work otherwise. Note that liveliness is tempting because it partially solves the image-size issue which is deferred until the document is attached and the images are loaded in the absence of explicit attributes. But on the other and if liveliness is an issue the main method of the API, transform-Document, needs to specify that the input document should not be live, and show a simple way of creating a non-live clone of a live HTMLDocument object.

I think another factor is the difficulty and processing time required in building our own hierarchy. Finding common ancestors and measuring depth and so forth seems like it is going to take a lot of time. Our generate-Blocks method is basically unfairly depriving us of the original hierarchical position of blocks, and now have to use heuristics and wasted processing just to identify it again.

When scoring each level of the hierarchy, consider both the text nodes within a container, and the text within its sub-nodes. Basically each container has two scores: the score based on its own text and the score based on its child scores. We can merge this together, we just need to make sure the child score sum does not overwrite/ignore the containers own score. Second, the score contributed from the children should affect the containers score differently than the score containers own text.

jfroelich commented 10 years ago

Under the DOM method, we would want to preprocess for testing purposes, then score from the bottom up. We find the leaves and go upward once each leaf in a branch has been scored. We then accumulate the leaf scores into the branch, taking into consideration the branch's own score. Then we hunt for the best branch based on its score. Then we consider nearby branches and possibly merge them or basically build an in-order list of the branches we consider as the 'main' content. Then we strip out all other branches, while still retaining the branches intervening the main section. Then maybe we do some cleanup and we are done.

jfroelich commented 10 years ago

Goldminer authors also cite the "Boilerplate Detection using Shallow Text Features" paper and reassert it is pretty good but with some problems. http://polibits.gelbukh.com/2013_48/More%20Effective%20Boilerplate%20Removal%20-%20the%20GoldMiner%20Algorithm.pdf. Look at other algorithms. For example, Body-Text-Extraction (BTE) assumes relevant part of text is contiguous, the density of tags in content is lower than boilerplate. But Fails on intervening content or when the tag/text ratio is high. In jusText approach, after initial classification, ‘almost good’ and ‘too short’ units surrounded by ‘good’ ones are reclassified as ‘good’. In onion, remove duplicate content sections per domain, allow only one instance of each piece of content, where content s considered similar according to basic string similarity technique. In goldminer, develop a template per domain and remove repeated template components across pages. beats onion because it leaves text more coherent.
NCleaner (http://www.lrec-conf.org/proceedings/lrec2008/pdf/885_paper.pdf). Runs in stages. First basic HTML preprocessing, strip scripts and such. Convert BR to P. Next, convert html to text (basically document.body.textContent). Next, cleanup the text (remove chars, pipes). Next, character-level ngram analysis.

justext: https://justext.googlecode.com/svn/trunk/justext/core.py

jfroelich commented 10 years ago

Removing Boilerplate and Duplicate Content from Web Corpora. http://is.muni.cz/th/45523/fi_d/phdthesis.pdf.

"boilerplate is usually defined rather vaguely as non-informative parts outside of the main content of a Web page, typically machine generated and repeated across the Web pages of the same website."
"Yi et al. [63] introduced a data structure called style tree. An HTML page can be represented with a DOM tree – a tree structure which models the parent-child relations between HTML tags and where the text is contained in leave nodes called text nodes. Fig. 2.2 shows an example of a DOM tree. A style tree is an extension of a DOM tree and it represents one or more merged DOM trees. In a style tree, the nodes with the same parent (sibling nodes) are grouped into style nodes. The equivalent style nodes (equivalent sequences of HTML tags) at the same positions in the original trees are merged and their count is remembered in the merged style node. The subtrees of the style tree containing style nodes with a high count indicate frequently repeated HTML markup styles which are typical for boilerplate content." (NOTE: highly similar to my current blocks concept).
2.4.2. In the segmentation step, the input HTML page is split into semantically coherent blocks. In the next steps, each block is classified as main content (clean text) or boilerplate. Identical to my approach and the shallow texts paper.
VIPS - identify visually coherent rectangular blocks in the visual form of a web page by splitting the original DOM in a top-down fashion until sufficiently coherent subtrees are formed. The level of coherence is determined by using features such as contained tags (blocks containing certain tags, such as hr, are split), background colour (only a single background colour is allowed in a block) or size (too large blocks get split). Gao and Abou-Assaleh [24]
"Bauer [9] accept a DOM node (and its subtree) as a segment if at least 10 % of its textual content is contained in its immediate children text nodes."
"Some algorithms do not fit well into the segment-classify concept. The BTE algorithm [22], for instance, finds a continuous part of a Web page which maximises its objective function. A similar approach is employed by Pasternack and Roth [50]."
Classification metrics
html metrics. tag density: the number of HTML tags divided by the number of words/tokens. link density: the proportion of tokens inside anchor tags. occurrence of certain tags and their count, parent tag type
textual metrics: nwords, nchars, nsentences, avg sent length, use of capitals, use of articles, avg word length
visual features: size and shape of block ,distance from edge, visibility
BTE: The idea is that the main body contains only little formatting and is therefore sparse in terms of HTML tags. The navigation links, advertisements and alike, on the other hand, contain a lot of tags.
Victor, which won CleanEval: uses classes header/paragraph/list/continuation/other. "The continuation class indicates the continuation of the previous content block." http://ufal.mff.cuni.cz/~pecina/files/wac-2008.pdf. "Victor: the Web-Page Cleaning Tool" (return to this).
Chakrabarti et al. [15] build a DOM tree. The scores are stored in the roots of the subtrees. Then a technique called isotonic smoothing is applied to adjust the scores in a way so that the score of each node is at least the score of its parent and at the same time the adjusted scores do not differ too much from the original values. Finally, the subtrees with the score below certain threshold are kept as main content.
Also, blocks containing a copyright symbol (c ) are marked as bad
The goal of the context-sensitive part of the algorithm is to re-classify the short and near-good blocks either as good or bad based on the classes of the surrounding blocks.
Random side note: if we use nodeIterator on the input tree it is an easy way to get the leaves and start from the bottom. we dont need to search.
Also, before we iterate over texts, we should split blocks containing rules (BR or HR) into 2 blocks.
The annotations in the Canola corpus are done at the level of DOM tree text nodes. Yes this is basically what I am thinking, of marking up the input DOM with content scores.
"Removing boilerplate may leave gaps in the cleaned texts. Consequently, it may be difficult to understand some parts of a cleaned text as the context information is missing. This may be especially problematic if the cleaning algorithm generates false negatives, i.e. it cleans good text as boilerplate." "This problem can be overcome to a high extent by using a more coarse-grained segmentation, e.g. by splitting the input only at block level HTML tags rather than taking each text node as a separate unit."

jfroelich commented 10 years ago

calamine.findBody still needs tweaking. Listing some random thoughts:

sometimes missing leading or trailing sections.
Second, the problem with choosing the highest block is that we dont look at the right scope. We need to try to first group the blocks further. Right now we just go up one single level. That is not enough. what we need to do is go up further where necessary, up to sectional parents. Identify general 'sections'. Score those sections and then pick the best section. The immediate common ancestors of the block is not right.
Or, we ignore sectional layout and just pass over the blocks and eliminate some number of low scoring blocks, somehow.
Or maybe we propagate down each blocks paths certain top down scores. For example, using HTML5 semantic elements like nav/article/footer/header, we can penalize all blocks under them.
One thing I noticed is that sometimes the containing parent has an incredibly negative score but has one block in it that got some absurdly high score due to the metrics, and that one block still ends up as the best block. Basically we have to make the parents re-propagate negative scores back to the children, because parents take into account other things like scores of siblings better.
Suppose we ignore the axis/path flattening technique. Just keep the hierarchy for this thought. We propagate scores bottom up. Then we propagate scores back down. Then we find the section with the highest score.

jfroelich commented 10 years ago

Page-level Template Detection via Isotonic Smoothing. http://www2007.org/papers/paper588.pdf
"[C]lassifying each DOM node in isolation ... does not take a global view of the templateness of nodes in the DOM tree." ... "We assert that templateness is a monotone property: a node in the DOM tree is a template if and only if all its children are templates. An appropriate relaxation of this property leads to the following regularized isotonic regression problem: given a tree with classifier scores at each node, find smoothed scores that are not far from the classifier scores, but satisfy the relaxed monotonicity property". "[E]nsure that the templateness score of a node is at most the least of its children’s scores, instead of equal to it ... [because] the cost of misclassifying a non-template as a template is much higher than vice versa." "[One example of a heuristic penalty is a penalty that] is high for nodes near the leaves and low for nodes near the root..."
Extracting Informative Textual Parts from Web Pages Containing User-Generated Content. http://www.icsd.aegean.gr/lecturers/stamatatos/papers/I-KNOW2012.pdf. "The method makes use of a combination of non-visual and visual characteristics of a web page in order to achieve the page segmentation and the filtering of noisy areas." Mentions strong/weak (block/inline). "Distance from max density region ... calculated by dfm(r) = 100 " (dr ⇤ 100)/dmax, where dr is the density of the examined region r and dmax is the density of the region with the max density. A high value of dfm for a specific region r means that we have to deal with a small region in the document, in contrast, small values of dfm represent bigger regions. Distance from root - the number of parents of a node until the root node. Ancestor title, ancestor title level, The cardinality of a node is simply the number of elements that a node contains, i.e. the number of child nodes.
Two thresholds are used, the max density region distance threshold T1 and the min region density threshold T2. The T1 threshold, defines the max allowed distance from max density region, so for each node the threshold is compared against the distance from max density region (dfm). The T2 threshold, defines the min allowed density for a node in order not to be treated as noise."

jfroelich commented 10 years ago

What if I do a second type of 'is inline' pass where i focus on 'is segment'. in this case map LI to UL or OL, map P to container, map TD to TR, and so forth. This gives us more semantic elements to map toward.

We can also use the new HTML 5 elements. E.g. if article is in parent path, roll up into article, if nav is in parent path, roll up into nav.

jfroelich commented 10 years ago

Aside from a few bugs and minor todos, (which should be created as separate issues), this is largely implemented in the calamine module.

jfroelich / rss-reader

Implement boilerplate scrubbing #190