trailofbits / graphtage

A semantic diff utility and library for tree-like files such as JSON, JSON5, XML, HTML, YAML, and CSV.
GNU Lesser General Public License v3.0
2.37k stars 45 forks source link

Text missing from HTML diff #80

Open kwyntes opened 10 months ago

kwyntes commented 10 months ago

When using the following test HTML files as input...

$ cat old.html
<html>
        <body>
                some <div>text and more</div> text
        </body>
</html>

$ cat new.html
<html>
        <body>
                some <div class='red'>text</div> and more <strong>text</strong>
        </body>
</html>

$ graphtage old.html new.html
<html>
        <body>
                some <̟d̟i̟v̟ ̟c̟l̟a̟s̟s̟=̟"̟r̟e̟d̟"̟>̟t̟e̟x̟t̟<̟/̟d̟i̟v̟>̟
        <̟s̟t̟r̟o̟n̟g̟>̟t̟e̟x̟t̟<̟/̟s̟t̟r̟o̟n̟g̟>̟
        <̶d̶i̶v̶>̶t̶e̶x̶t̶ ̶a̶n̶d̶ ̶m̶o̶r̶e̶<̶/̶d̶i̶v̶>̶
    </body>
</html>

+ screenshot: image

..., as you can see, the text and more is missing from the diff generated by graphtage.

I've tried some other diff tools and it seems and none of them had any success with correctly processing these two files for some reason (many are using the same core algorithm I suppose). Is there some kind of general issue with processing text not enclosed in tags (as in, and more is between two elements, but not enclosed in any tag (apart from the parent <body> tag) itself)?

I have also tried surrounding and more in a <p> tag in new.html, which resulted in this mess:

$ graphtage old.html new.html
<html>
        <body>
                some <̟d̟i̟v̟ ̟c̟l̟a̟s̟s̟=̟"̟r̟e̟d̟"̟>̟t̟e̟x̟t̟<̟/̟d̟i̟v̟>̟
        <p̟d̶i̶v̶>t̶e̶x̶t̶ ̶and more</p̟d̶i̶v̶>
        <̟s̟t̟r̟o̟n̟g̟>̟t̟e̟x̟t̟<̟/̟s̟t̟r̟o̟n̟g̟>̟
    </body>
</html>

+ screenshot: image

What's happening?