similar_region was only using prefix/suffix token (i.e. the tag name, e.g ul, div, p, etc) sequence to calculate a score. but sometimes these token are not discriminative enough. on the other hand, nowadays HTML tend to use same attributes for similar elements, so this change improve the similar_region by counting these information too:
the prefix sequence was compared to find the start point of where to start.
only when prefix matches, the class attributes are compared.
using other information could be possible too, e.g. the html tag data fragment, this could be helpful to mach the label text content.
Another improvement is to panelize the far away matched suffix. currently a hardcoded value is used, but it works well on both nosetests and as2 regression tests. some thoughts on further improvement:
similar to prefix match with class attributes. but it need an bigger change since currently close tag don't have attributes.
penalize with the tree distance to prefix_index (e.g. number of common ancestor ). but scrapely has its own html parsing. there is no tree structure about the parsed tags.
Performance evaluation on as2 regression tests:
before
# test cases in total 914
# average precision: 0.978571428571
# average accuracy: 0.916947368421
# average recall: 0.970987218045
after
# test cases in total 914
# average precision: 0.978571428571
# average accuracy: 0.900947368421
# average recall: 0.965553884712
similar_region
was only using prefix/suffix token (i.e. the tag name, e.g ul, div, p, etc) sequence to calculate a score. but sometimes these token are not discriminative enough. on the other hand, nowadays HTML tend to use same attributes for similar elements, so this change improve the similar_region by counting these information too:using other information could be possible too, e.g. the html tag data fragment, this could be helpful to mach the label text content.
Another improvement is to panelize the far away matched suffix. currently a hardcoded value is used, but it works well on both nosetests and as2 regression tests. some thoughts on further improvement:
Performance evaluation on as2 regression tests:
before
after
there is no significant drawbacks.