matiskay / html-similarity

Compare html similarity using structural and style metrics
BSD 3-Clause "New" or "Revised" License
210 stars 23 forks source link

Feature request. TED algorithm #75

Closed Sab0tag3d closed 4 years ago

Sab0tag3d commented 4 years ago

First of all, thanks for the amazing tool!

In the research of Thamme Gowda and Chris Mattmann they use ZhangShasha’s tree edit distance (TED) algorithm for comparing HTML's DOM trees. I've found the python library implements this algorithm: https://zhang-shasha.readthedocs.io/en/latest/#tree-format-and-usage

I think it could be more accurate than SequenceMatcher in difflib. But I ran into the problem of creating a node from HTML.

Do you have any thoughts or ideas about how to create a node from HTML? Do you think it could be helpful for compare?

matiskay commented 4 years ago

Hi @Sab0tag3d, you can use https://github.com/scrapy/parsel then you can use its methods to get the information you need.

Sab0tag3d commented 4 years ago

Thanks!