Dunedan commented 2 years ago

Bug report

csv.Sniffer().sniff() detects the wrong field delimiter if the possible valid delimiters contain \t and the provided data contains one line starting with the combination of double quotes followed by a tab and has the same combination of double quotes and a tab at least once more afterwards in the data. In that case the delimiter is always reported as \t instead of the correct one.

Configuring the preferred separator as suggested in https://github.com/python/cpython/issues/80678#issuecomment-1093819685 doesn't have any effect on this behavior.

Here is an example to reproduce it:

import csv
from pprint import pprint

data = """
Surname;First Name;Year of birth
"\tDoe;Jane"\t;1971
"""

sniffer = csv.Sniffer()
dialect = sniffer.sniff(data, delimiters=",;\t")
pprint(dict(dialect.__dict__))

Expected output:

{'__doc__': None,
 '__module__': 'csv',
 '_name': 'sniffed',
 'delimiter': ';',
 'doublequote': False,
 'lineterminator': '\r\n',
 'quotechar': '"',
 'quoting': 0,
 'skipinitialspace': False}

Actual output:

{'__doc__': None,
 '__module__': 'csv',
 '_name': 'sniffed',
 'delimiter': '\t',
 'doublequote': False,
 'lineterminator': '\r\n',
 'quotechar': '"',
 'quoting': 0,
 'skipinitialspace': False}

Your environment

CPython versions tested on: 3.7.14, 3.8.14, 3.9.14, 3.10.7
Operating system and architecture: Debian/sid on amd64

rhettinger commented 2 years ago

I don't think this can be considered a bug. The sample fragment offers insufficient information for the sniffer heuristics to deduce the result you expected.

The sniffer reasonably makes a first attempt to deduce the quote character and then infer the delimiter:

    def _guess_quote_and_delimiter(self, data, delimiters):
        """
        Looks for text enclosed between two identical quotes
        (the probable quotechar) which are preceded and followed
        by the same character (the probable delimiter).
        For example:
                         ,'some text',
        The quote with the most wins, same with the delimiter.
        If there is no quotechar the delimiter can't be determined
        this way.
        """

The sample snippet doesn't follow this pattern because it only uses the intended quoting characters once and at the beginning of a line, and is inexplicably followed by a tab character outside of the quoted text and before the intended delimiter. If there were even a single example of quoting outside the first field, the sniffer would deduce your intended outcome:

data = """
Surname;First Name;Year of birth
"\tDoe;Jane"\t;1971
"Le Trec";"Mary Ann";1486
"""

Likewise, if typical quoting patterns were followed (the entire field quoted or not), it would work as intended:

data = """
Surname;First Name;Year of birth
"\tDoe";"Jane\t";1971
"""

Since the snippet doesn't follow any of the norms expected by _guess_quote_and_delimiter(), the logic falls back _guess_delimiter() which uses reasonably uses this heuristic:

        The delimiter /should/ occur the same number of times on
        each row. However, due to malformed data, it may not. We don't want
        an all or nothing approach, so we allow for small variations in this
        number.
          1) build a table of the frequency of each character on every line.
          2) build a table of frequencies of this frequency (meta-frequency?),
             e.g.  'x occurred 5 times in 10 rows, 6 times in 1000 rows,
             7 times in 2 rows'
          3) use the mode of the meta-frequency to determine the /expected/
             frequency for that character
          4) find out how often the character actually meets that goal
          5) the character that best meets its goal is the delimiter
        For performance reasons, the data is evaluated in chunks, so it can
        try and evaluate the smallest portion of the data possible, evaluating
        additional chunks as necessary

The provided sample is insufficient for this as well.

Dunedan commented 2 years ago

Sorry, apparently my initial example wasn't a good one. Here is a better one, illustrating the issue with the entire field quoted:

data = """
Surname;First Name;Year of birth
"\t";"\t";1971
"Le Trec";"Mary Ann";1486
"""

If I'm not mistaken your explanation doesn't explain why the field delimiter is still detected as \t with this example, as fields are properly quoted now and there are also quoted fields with different content.

I noticed this behavior because of how Microsoft Excel handles CSV files: When opening a CSV file in Excel which contains fields with a single (unquoted) tab character and saving the file again, Excel adds double quote around it like shown above. If a single line contains at least two of these tab-only fields and one of them is in the first row, the delimiter is wrongly detected as \t.

rhettinger commented 2 years ago

The sniffer logic picks between possible alternatives by selecting the highest number of matches. This means that the sample needs to contain more of the expected case than the unusual case. For example, just repeating the last line will give the desired result because the desired case wins the election:

Surname;First Name;Year of birth
"\t";"\t";1971
"Le Trec";"Mary Ann";1486
"Le Trec";"Mary Ann";1486
"Le Trec";"Mary Ann";1486

I'm adding other developers who worked on this code. Perhaps, they can add some insight.

Here are a couple of other misc thoughts I wanted to jot down so they won't get lost:

If you know the output is from Excel, why use the Sniffer at all, just the the excel dialect?
Changing the sniffer logic would be precarious. If a different dialect is deduced from the same data, then deployed code that has been working can fail.

Dunedan commented 2 years ago

The sniffer logic picks between possible alternatives by selecting the highest number of matches. This means that the sample needs to contain more of the expected case than the unusual case. For example, just repeating the last line will give the desired result because the desired case wins the election:

From my perspective that's already the case: The sample contains way more semicolons than tabs. No human would consider tabs to be the field delimiter looking at the data.

What you're getting at are probably the number of double quotes and their enclosed characters though. In our case that didn't help, as other lines didn't have the need for and therefore didn't contain double quotes.

If you know the output is from Excel, why use the Sniffer at all, just the the excel dialect?

In this case, we need to be able to process CSV-files no matter how they got generated and which dialect they use. Excel-generated CSV-files are just one of the possibilities. That's why we use the sniffer there. I also did double check that this includes being able to detect CSV-files which actually use tabs as field delimiter. If that wouldn't have been the case we could've simply removed tabs from the possible valid delimiters.

Changing the sniffer logic would be precarious. If a different dialect is deduced from the same data, then deployed code that has been working can fail.

Maybe it'd help to keep the default behavior, but let it detect the correct delimiter when specifying the Sniffer().preferred attribute.

serhiy-storchaka commented 2 years ago

It does not fall back to _guess_delimiter(). _guess_quote_and_delimiter() counts not only the quoted text surrounded by delimiters, but the quoted text at the beginning or at the ending of the line, followed or preceded by the delimiter. And in this example it founds quoted "\tDoe;Jane" followed by \t. There are no other quoted fields, so it wins.

lizdeika commented 1 year ago

Hi, Got the same problem with some amazon selling partner api GET_AMAZON_FULFILLED_SHIPMENTS_DATA_GENERAL reports. The file is "\t" delimited. But python 3.11.3 csv Sniffer detects it as " " (space). I have opened the same report file with macos Numbers and LibreOffice. Both programs detect delimiter correctly and data is presented as it should

python / cpython

CSV sniffing detects wrong field delimiter #97611

Bug report

Your environment