python / cpython

The Python programming language
https://www.python.org
Other
63.42k stars 30.37k forks source link

difflib adding an additional html column when making tables using difflib.HtmlDiff() #117943

Open needadiff opened 6 months ago

needadiff commented 6 months ago

Bug report

Bug description:

I am working with the python library difflib and specifically the class HtmlDiff. For some reason, the function make_table is adding a blank column in when generating an HTML table, throwing off difference highlights and defeating the entire purpose of the diff function alltogether.

Input python function:

# diff_tool.py

import argparse
import difflib
import sys

from pathlib import Path

def create_diff(output_file: Path = None):

    # short list of strings to compare, nonproblematic (see expected output)
    thing1 = ["this", "that"]
    thing2 = ["thiis", "that2"]

    # longer list of strings to compare, mirroring the problematic data (see actual output)
    file_2 = ['Ab', 
    'my_id: ID1234, attribute: abcd fghijklmn py stuvwx zabc efg ijklmnopy st vwxy abcd f , name: abcd fghijklmn py stuvwx zabc efg ijklmnopy st vwxy abcd f ', 
    'my_id: ID5678, attribute: abcde ghijklm opyrst vwxyz bc efghij lmnopyr tuv xyzab defabcde ghijklmno yrstuvwxyza cdefghijklmnop rstuvw yz b defab defgh jklmnop rstuvwxyz bcd fghijkl nopyr tuvwxyza cdefabcd fghijkl no yrs uvwxyzabcd fghij l n, name: ab defghi klmnopy stuvwxy abcde ghijklm op']

    file_2_new = ['Abcdefghijklmn', 
    'my_id: ID1234, attribute: abcde ghijklm opyrst vwxyz bc efghij lmnopyr tuv xyzab defabcde ghijklmno yrstuvwxyza cdefghijklmnop rstuvw yz b defab defgh jklmnop rstuvwxyz bcd fghijkl nopyr tuvwxyza cdefabcd fghijkl no yrs uvwxyzabcd fghij l n, name: ab defghi klmnopy stuvwxy abcde ghijklm op', 
    'my_id: ID5678, attribute: abcd fghijklmn py stuvwx zabc efg ijklmnopy st vwxy abcd f , name: abcd fghijklmn py stuvwx zabc efg ijklmnopy st vwxy abcd f ']

    # if output HTML file option at command line
    if output_file:

        # create HtmlDiff Object
        my_html = difflib.HtmlDiff()

        # make table with problematic data, delta1 is a string
        delta1 = my_html.make_table(
            file_2, file_2_new, "table1A" , "table1B"
        )     

        # replace nowrap tags and weird space characters
        delta1 = delta1.replace(" ", " ")
        delta1 = delta1.replace("nowrap=\"nowrap\"", "")

        # html header which goes at beggining of html file
        html_header = '''

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
          "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html>
<body style="background-color:white;">
<head>
    <meta http-equiv="Content-Type"
          content="text/html; charset=utf-8" />
    <title></title>
    <style type="text/css">
        table.diff {font-family:Courier; color: black; border:medium;table-layout: auto; width: 96%; word-wrap: break-word; margin-left: auto; margin-right: auto;}
        td.diff_header {text-align:left}
        .diff_header {background-color:#e0e0e0}
        .diff_next {background-color:#c0c0c0}
        .diff_add {background-color:#aaffaa}
        .diff_chg {background-color:#ffff77}
        .diff_sub {background-color:#ffaaaa}
        .widthA{width:1%}
        .widthB{width:7%}
        .widthC{width:40%}
        .element_title{color: black; text-align:center}
        .element_diff{color: black; width: 96%; margin-left: auto; margin-right: auto}
    </style>
</head>

'''
        # html footer which goes at the end of html file
        html_footer = '''
        </body>

</html>
        '''

        # open output file path and write header, table and footer
        with open(output_file, "w") as f:
            f.write(html_header)
            f.write(delta1)
            f.write(html_footer)

def main():

    # parse command line option for output HTML file
    parser = argparse.ArgumentParser()
    parser.add_argument("--html", help="specify html to write to")
    args = parser.parse_args()

    if args.html:
        output_file = Path(args.html)
    else:
        output_file = None

    # call create_diff function which compares two lists of strings
    # and returns a formatted HTML table comparing them
    create_diff(output_file)

if __name__ == "__main__":
    main()

Output HTML file with problematic data:


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
          "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html>
<body style="background-color:white;">
<head>
    <meta http-equiv="Content-Type"
          content="text/html; charset=utf-8" />
    <title></title>
    <style type="text/css">
        table.diff {font-family:Courier; color: black; border:medium;table-layout: auto; width: 96%; word-wrap: break-word; margin-left: auto; margin-right: auto;}
        td.diff_header {text-align:left}
        .diff_header {background-color:#e0e0e0}
        .diff_next {background-color:#c0c0c0}
        .diff_add {background-color:#aaffaa}
        .diff_chg {background-color:#ffff77}
        .diff_sub {background-color:#ffaaaa}
        .widthA{width:1%}
        .widthB{width:7%}
        .widthC{width:40%}
        .element_title{color: black; text-align:center}
        .element_diff{color: black; width: 96%; margin-left: auto; margin-right: auto}
    </style>
</head>

    <table class="diff" id="difflib_chg_to0__top"
           cellspacing="0" cellpadding="0" rules="groups" >
        <colgroup></colgroup> <colgroup></colgroup> <colgroup></colgroup>
        <colgroup></colgroup> <colgroup></colgroup> <colgroup></colgroup>
        <thead><tr><th class="diff_next"><br /></th><th colspan="2" class="diff_header">table1A</th><th class="diff_next"><br /></th><th colspan="2" class="diff_header">table1B</th></tr></thead>
        <tbody>
            <tr><td class="diff_next" id="difflib_chg_to0__0"><a href="#difflib_chg_to0__top">t</a></td><td class="diff_header" id="from0_1">1</td><td ><span class="diff_sub">Ab</span></td><td class="diff_next"><a href="#difflib_chg_to0__top">t</a></td><td class="diff_header" id="to0_1">1</td><td ><span class="diff_add">Abcdefghijklmn</span></td></tr>
            <tr><td class="diff_next"></td><td class="diff_header" id="from0_2">2</td><td ><span class="diff_sub">my_id: ID1234, attribute: abcd fghijklmn py stuvwx zabc efg ijklmnopy st vwxy abcd f , name: abcd fghijklmn py stuvwx zabc efg ijklmnopy st vwxy abcd f </span></td><td class="diff_next"></td><td class="diff_header"></td><td ></td></tr>
            <tr><td class="diff_next"></td><td class="diff_header" id="from0_3">3</td><td >my_id: ID<span class="diff_chg">5678</span>, attribute: abcde ghijklm opyrst vwxyz bc efghij lmnopyr tuv xyzab defabcde ghijklmno yrstuvwxyza cdefghijklmnop rstuvw yz b defab defgh jklmnop rstuvwxyz bcd fghijkl nopyr tuvwxyza cdefabcd fghijkl no yrs uvwxyzabcd fghij l n, name: ab defghi klmnopy stuvwxy abcde ghijklm op</td><td class="diff_next"></td><td class="diff_header" id="to0_2">2</td><td >my_id: ID<span class="diff_chg">1234</span>, attribute: abcde ghijklm opyrst vwxyz bc efghij lmnopyr tuv xyzab defabcde ghijklmno yrstuvwxyza cdefghijklmnop rstuvw yz b defab defgh jklmnop rstuvwxyz bcd fghijkl nopyr tuvwxyza cdefabcd fghijkl no yrs uvwxyzabcd fghij l n, name: ab defghi klmnopy stuvwxy abcde ghijklm op</td></tr>
            <tr><td class="diff_next"></td><td class="diff_header"></td><td ></td><td class="diff_next"></td><td class="diff_header" id="to0_3">3</td><td ><span class="diff_add">my_id: ID5678, attribute: abcd fghijklmn py stuvwx zabc efg ijklmnopy st vwxy abcd f , name: abcd fghijklmn py stuvwx zabc efg ijklmnopy st vwxy abcd f </span></td></tr>
        </tbody>
    </table>
        </body>

</html>

The problematic line in the HTML is:

<tr><td class="diff_next"></td><td class="diff_header" id="from0_2">2</td><td ><span class="diff_sub">my_id: ID1234, attribute: abcd fghijklmn py stuvwx zabc efg ijklmnopy st vwxy abcd f , name: abcd fghijklmn py stuvwx zabc efg ijklmnopy st vwxy abcd f </span></td><td class="diff_next"></td><td class="diff_header"></td><td ></td></tr>

Where there is an extra <td ></td> at the end, adding an extra column.

This is what the table looks like with the unwanted column

This is what the table is supposed to look like when the columns are aligned correctly

I have tried changing the length of the string, checking for invisible characters, removing colons. There is something wrong with the strings that I am providing as input which throws off the make_table function. I have provided make_table with longer strings and the output was just fine. The behavior is very inconsistent.

Thanks for the help.

CPython versions tested on:

CPython main branch

Operating systems tested on:

Linux

tomasr8 commented 6 months ago

The output looks correct to me. In your case the third line in table 'A' is almost the same as the second line in table B (except for the ID) so displaying it this way the diff is probably smaller e.g:

- line 2
- line 3
+ line 2
+ line 3

vs

-line 2
my_id: ID5678, attribute: abcde ...
+ line 3
needadiff commented 6 months ago

The output looks correct to me. In your case the third line in table 'A' is almost the same as the second line in table B (except for the ID) so displaying it this way the diff is probably smaller e.g:

I understand what you are saying, but the output is not correct. Here is a better example to demonstrate: image

The first table is displayed correctly. There are two rows for each table, one for each string. The string in row 1, table 1A is "abcf", while the string in row 1, table 1B is "abcef". The difference between the two is the character "e", which is highlighted in green to demonstrate that this character was added.

The second table is not displayed correctly. In this case, we are comparing two strings "abcd" and "abcde". Similar to the first table, this should have two rows, where both "abcd" and "abcde" are lined up together on row1. The character "e" should be highlighted in green to show that the single character was added between the two strings. Instead, there is a blank column added to the table2A, and the entirety of the string "abcde" in table2B is highlighted in green, as if to show that the difference between a blank string and "abcde" is the entirety of the string "abcde".

I have been looking through the sourcecode of difflib to find a solution without luck. I am thinking this is a bug with the library when the list of strings provided too closely mirror each other. Thoughts?

rhit-parsonjc commented 6 months ago

I think this is because difflib does a line-by-line comparison to find matching lines first before determining how to handle each line. For example, in this code snippet:

from difflib import Differ
lines1 = ['abcdf', 'abc', 'abcde']
lines2 = ['abcde', 'abx', 'abcdg']
print(list(Differ().compare(lines1, lines2)))

The output is:

['- abcdf', '- abc', '  abcde', '+ abx', '+ abcdg']

Within the compare method of the Differ class, a SequenceMatcher object called cruncher is used to find common lines. Because the 'abcde' is present in both lines1 and lines2, cruncher will suggest to delete the first two lines of line1, and add the last two lines of line2.

There is, however, no guarantee that the difference generated will have the fewest number of characters.

bombs-kim commented 1 day ago

@needadiff Any widely used diff algorithm first tries to find exactly matching lines, rather than finding intraline differences. Maybe trying online diff tools could convince you.

https://www.diffchecker.com/text-compare/

Or give git a try.

Please close this issue if you find this agreeable.