Closed Sab0tag3d closed 4 years ago
First of all, thanks for the amazing tool!
In the research of Thamme Gowda and Chris Mattmann they use ZhangShasha’s tree edit distance (TED) algorithm for comparing HTML's DOM trees. I've found the python library implements this algorithm: https://zhang-shasha.readthedocs.io/en/latest/#tree-format-and-usage
I think it could be more accurate than SequenceMatcher in difflib. But I ran into the problem of creating a node from HTML.
Do you have any thoughts or ideas about how to create a node from HTML? Do you think it could be helpful for compare?
Hi @Sab0tag3d, you can use https://github.com/scrapy/parsel then you can use its methods to get the information you need.
Thanks!
First of all, thanks for the amazing tool!
In the research of Thamme Gowda and Chris Mattmann they use ZhangShasha’s tree edit distance (TED) algorithm for comparing HTML's DOM trees. I've found the python library implements this algorithm: https://zhang-shasha.readthedocs.io/en/latest/#tree-format-and-usage
I think it could be more accurate than SequenceMatcher in difflib. But I ran into the problem of creating a node from HTML.
Do you have any thoughts or ideas about how to create a node from HTML? Do you think it could be helpful for compare?