simonw / csv-diff

Python CLI tool and library for diffing CSV and JSON files
Apache License 2.0
295 stars 48 forks source link

ERROR: CSV parse error #14

Open bernd-wechner opened 3 years ago

bernd-wechner commented 3 years ago

I'm trying to diff two CSV files and csv-diff just responds with:

ERROR: CSV parse error on line 2

So I do the same things using it as a python package (that is I write a Python script that loads my two files and runs csv--diff on them as per the README) and I get a different error:

KeyError: 'my_key'

Double check the key and it is there, as column 1 in the files which load fine in LibreOffice Calc and in Excel and look fine in a text editor.

So I look at the the file encoding and Python's magic library tells me:

'UTF-8 Unicode (with BOM) text, with very long lines, with CRLF line terminators'

so if I open the file with an encoding of "utf-8-sig" all works fine.

Seems to me, to be a file encoding issue, and one I have encountered in Python a lot so I wrote this:

def file_encoding(filepath):
    '''
    Text encoding is a bit of a schmozzle in Python. Alas.

    A quick summary:

    1. I come across CSV files with a UTF-8 or UTF-16 encoding regularly enough.
    2. Python wants to know the encoding when we open the file
    3. UTF-16 is fine, but UTF-8 comes in two flavours, with and without a BOM
    4. The BOM (byte order mark) is an optional and irrelevant to UTF-8 field
    5. In fact Unicode standards recommend against including a BOM with UTF-8
        https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
    6. Python assumes it's not there
    7. Some CSV sources though write with a BOM
    8. The encoding must therefore be specified as:
        utf-16     for UTF-16 files
        utf-8       for UTF-8 files with no BOM
        utf-8-sig for UTF files with a BOM 
    9. The "magic" library reliably determines the encoding efficiently by looking
       at the magic numbers at the start of a file
    10. Alas it returns a rich string describing the encoding.
    11. It contains either UTF-16 or UTF-18
    12. It contains "(with BOM)" if a BOM is detected
    13. Because of this schmozzle a quick function to translate "magic" output
        to standard encoding names is here.

    :param filepath: The path to a file
    '''
    m = magic.from_file(filepath)
    utf16 = m.find("UTF-16")>=0
    utf8 = m.find("UTF-8")>=0
    bom = m.find("(with BOM)")>=0

    if utf16:
        return "utf-16"
    elif utf8:
        if bom:
            return "utf-8-sig"
        else:
            return "utf-8"

and then if I run:

with open(File1, "r", encoding=file_encoding(File1), newline='') as f1:
    csv1 = load_csv(f1, key=key)

with open(File2, "r", encoding=file_encoding(File2), newline='') as f2:
    csv2 = load_csv(f2, key=key)

diff = compare(csv1, csv2)

all is good and I get a reliable diff.

I can't work out how to debug the CLI interface in PyDev alas. I'm a tad green in this space it seems. But setup.py build just creates a build folder with a lib folder with __init__.py and cli.py in it. Yet my Windows box (man I hate Windows but I'm stuck there right now) runs a csvdiff.exe which was presumably installed by pip when I installed csv-diff (pip install csv-diff). But I can't see how to run the CLI interface from the source. Guess I could do some reading on click and setup-tools, but hey for the moment, I have it working via its Python package interface and can run with that.

If the CLI error is in fact related to this encoding issue (hard to know for sure), then it could of course be fixed by including an encoding check as above and opening the files with their appropriate encoding. Frankly it'd be nice if python's open() could better guess the encoding (the way magic can).

patric-r commented 3 years ago

Having this feature would be awesome.

mikecoop83 commented 3 years ago

Having this feature would be awesome.

If you get a chance, could you try out my PR to see if it solves your problem?

rene-schwabe commented 3 years ago

Any chance the PR from @mikecoop83 gets merged?