Closed jfroelich closed 10 years ago
Notes on Boilerplate Detection using Shallow Text Features:
Actually, we should generate the DOM here. Grouping should essentially involve building a new DOM from the blocks. We then score each dom node, and apply scores to container-type dom nodes (div/ul/ol/p/table/tr).
Then we find the best container. So we are not trying to find bounds of blocks in the list, but the best container in the hierarchy.
Each container's score should reflect its block scores, in aggregate.
We should allow for negative block scores in the scoring section so that blocks can negatively influence the container score.
We don't need to filter low scoring blocks when rebuilding the hierarchy. The filtering is a natural side effect of choosing the best container. It also avoids some of the need to reinsert intervening boilerplate.
We still probably need to extend each container to its siblings at the same depth, and recalculate the container's score after the merging of siblings. Not 100% sure what the criteria is for merging. It could be whether the score increased or decreased above or below some threshold.
Note: we should refactor the biases to allow for negative scoring. We want score magnitude. For example, bad link density should not only not increase the score, it should actually decrease it.
Previous note: the extend method should consider hierarchical position, not simply block index distance. Maybe what we want is something analogous to agglomerative clustering. What we are doing is defining our own distance function where two blocks are similar if they share the same approximate depth. The path between the blocks is based on the number of traversals up the hierarchy and then back down from the common ancestor of any two blocks. Paths are analogous to axes in xPath. Maybe we can also think of them as vectors into the hierarchical structure?
Then there is the question of how to rebuild the hierarchy. What we could do is use the original input hierarchy and scrub it in place, augmenting it with content scores and filtering out low scores. That requires a rewrite of this entire module because it is fundamentally different. It is also top down. One of the negative factors is that the original hierarchy is subject to the control of the author and not calamine, which means we get a junk hierarchy full of useless containers. For example, div wrappers for CSS and other layout techniques where the purpose of the container is purely for layout and not for designating main content. Furthermore, we don't even want to consider all of the nodes in the original hierarchy. There are certain nodes we can plainly ignore (meta, style, head, etc.). We also have to scrub the original attributes except for the ones we want.
Note another difficulty with the original hierarchy is that we need to be able to modify it in place. I solved this issue with the way sanitize was built to do in place mutation during iteration by doing a look-ahead for the node to iterate to next based on the mutation operation (keep node, remove node, unwrap node, etc). I am not sure I ever solved how to do splitting. I already know that I need to do this, because I want to consider BR elements as resulting in two blocks.
One other difficulty is the liveliness of the input HTMLDocument object. We need to be able to operate under the assumption the original document is not important anymore, because we are going to be mutating it. This works if the doc is created via document.implementation, but I am not sure how it would work otherwise. Note that liveliness is tempting because it partially solves the image-size issue which is deferred until the document is attached and the images are loaded in the absence of explicit attributes. But on the other and if liveliness is an issue the main method of the API, transform-Document, needs to specify that the input document should not be live, and show a simple way of creating a non-live clone of a live HTMLDocument object.
I think another factor is the difficulty and processing time required in building our own hierarchy. Finding common ancestors and measuring depth and so forth seems like it is going to take a lot of time. Our generate-Blocks method is basically unfairly depriving us of the original hierarchical position of blocks, and now have to use heuristics and wasted processing just to identify it again.
When scoring each level of the hierarchy, consider both the text nodes within a container, and the text within its sub-nodes. Basically each container has two scores: the score based on its own text and the score based on its child scores. We can merge this together, we just need to make sure the child score sum does not overwrite/ignore the containers own score. Second, the score contributed from the children should affect the containers score differently than the score containers own text.
Under the DOM method, we would want to preprocess for testing purposes, then score from the bottom up. We find the leaves and go upward once each leaf in a branch has been scored. We then accumulate the leaf scores into the branch, taking into consideration the branch's own score. Then we hunt for the best branch based on its score. Then we consider nearby branches and possibly merge them or basically build an in-order list of the branches we consider as the 'main' content. Then we strip out all other branches, while still retaining the branches intervening the main section. Then maybe we do some cleanup and we are done.
justext: https://justext.googlecode.com/svn/trunk/justext/core.py
Removing Boilerplate and Duplicate Content from Web Corpora. http://is.muni.cz/th/45523/fi_d/phdthesis.pdf.
calamine.findBody still needs tweaking. Listing some random thoughts:
What if I do a second type of 'is inline' pass where i focus on 'is segment'. in this case map LI to UL or OL, map P to container, map TD to TR, and so forth. This gives us more semantic elements to map toward.
We can also use the new HTML 5 elements. E.g. if article is in parent path, roll up into article, if nav is in parent path, roll up into nav.
Aside from a few bugs and minor todos, (which should be created as separate issues), this is largely implemented in the calamine module.
Distinguish between useful content and boilerplate.
Doing some prelim research: