scrapy / scrapely

A pure-python HTML screen-scraping library
1.86k stars 272 forks source link

allow use html tag attributes in similar_region #48

Closed tpeng closed 10 years ago

tpeng commented 10 years ago

similar_region was only using prefix/suffix token (i.e. the tag name, e.g ul, div, p, etc) sequence to calculate a score. but sometimes these token are not discriminative enough. on the other hand, nowadays HTML tend to use same attributes for similar elements, so this change improve the similar_region by counting these information too:

using other information could be possible too, e.g. the html tag data fragment, this could be helpful to mach the label text content.

Another improvement is to panelize the far away matched suffix. currently a hardcoded value is used, but it works well on both nosetests and as2 regression tests. some thoughts on further improvement:

Performance evaluation on as2 regression tests:

before

# test cases in total 914
# average precision: 0.978571428571
# average accuracy: 0.916947368421
# average recall: 0.970987218045

after

# test cases in total 914
# average precision: 0.978571428571
# average accuracy: 0.900947368421
# average recall: 0.965553884712

there is no significant drawbacks.

tpeng commented 10 years ago

will be covered by another PR. closing it now