python / cpython

The Python programming language
https://www.python.org
Other
63.13k stars 30.22k forks source link

csv.Sniffer does not detect lineterminator #75008

Open 1cf21dd1-a46e-4b77-bcc2-5ab1583fc183 opened 7 years ago

1cf21dd1-a46e-4b77-bcc2-5ab1583fc183 commented 7 years ago
BPO 30825
Nosy @smontanaro, @nascheme, @terryjreedy, @vmax, @GjjvdBurg
PRs
  • python/cpython#2529
  • python/cpython#18336
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['3.8', 'type-feature', 'library'] title = 'csv.Sniffer does not detect lineterminator' updated_at = user = 'https://github.com/vmax' ``` bugs.python.org fields: ```python activity = actor = 'vmax' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'vmax' dependencies = [] files = [] hgrepos = [] issue_num = 30825 keywords = ['patch'] message_count = 7.0 messages = ['297497', '311804', '311805', '311806', '327094', '329482', '329487'] nosy_count = 5.0 nosy_names = ['skip.montanaro', 'nascheme', 'terry.reedy', 'vmax', 'Gertjan van den Burg'] pr_nums = ['2529', '18336'] priority = 'normal' resolution = None stage = 'patch review' status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue30825' versions = ['Python 3.8'] ```

    1cf21dd1-a46e-4b77-bcc2-5ab1583fc183 commented 7 years ago

    Line terminator defaults to '\r\n' while detecting dialect in csv.Sniffer

    terryjreedy commented 6 years ago

    The csv expert listed in https://devguide.python.org/experts/ is marked as inactive, and I have never used the module. So you might need to ask for help on core-mentorship list.

    The csv doc for Sniffer.sniff says "Analyze the given sample and return a Dialect subclass reflecting the parameters found." It is not clear to me whether 'the parameters found' is meant to be all possible parameters or just those found. So, to be conservative, I will initially treat this an a feature addition for the the next version, rather than a bug to also be fixed in current versions. It does seem like a reasonable request.

    terryjreedy commented 6 years ago

    Looking at the code and docstring, lineterminator was intentionally (knowingly) not sniffed, making this a feature addition.

    terryjreedy commented 6 years ago

    While Sniffer *returns a dialect with lineterminator = '\r\n', it *uses '\n' for splitting. This is slightly odd, as it leaves lines terminated by '\r' while detecting within-line parameters, but it does not affect such detection.

    Are there csv files in the wild that use \r as line terminator. If so, they will not currently get split.

    nascheme commented 6 years ago

    There is another issue related to this. If you use codecs to get a reader, it uses str.splitlines() internally, which treats a bunch of different characters as line terminators. See issue bpo-18291 and:

    https://docs.python.org/3.8/library/stdtypes.html#str.splitlines

    I was thinking about different ways to fix this. First, the csv module suggests you pass newline='' to the file object. I suspect most people don't know to do that. So, I thought maybe the csv module should inspect the file object that gets passed in and then warn if newline='' has not been used or if the file is a codecs reader object.

    However, that seems fairly complicated. Would it be better if we changed the 'csv' module to do its own line splitting? I think that would be better although I'm not sure about backwards compatibly. Currently, the reader expects to call iter() on the input file. Would it be okay if it used the 'read' method of it in preference to using iter()? It could still fallback to iter() if there was no read method.

    a604722a-64b3-42f8-ba7b-e9ee11877eb6 commented 5 years ago

    Note that the current CSV parser in _csv.c doesn't require the line terminator, it eats up \r and \n where necessary. See:

    https://github.com/python/cpython/blob/fd512d76456b65c529a5bc58d8cfe73e4a10de7a/Modules/_csv.c#L752

    This is why the line terminator isn't detected and doesn't need to be detected.

    Also, files that use the \r line terminator exist and are parsed correctly at the moment. See for example: https://raw.githubusercontent.com/hadley/data-fuel-economy/master/1998-2008/2008.csv

    smontanaro commented 5 years ago

    A couple comments.

    1. Terry Reedy wrote:

    The csv expert listed in https://devguide.python.org/experts/ is marked as inactive

    That would be me. I am indeed inactive w.r.t. fixing broken stuff, and don't want to feel obligated to jump in with both feet when a CSV ticket is raised. Still, I keep half an eye on things. If people are actually interested in my opinion on such stuff, drop me a line.

    1. Regarding the csv.Sniffer class... I've personally never found it useful, and would be happy to see it deprecated. I occasionally define a delimiter other than comma, and never specify the quotechar. (I've never seen anything other than quotation marks used anyway.) As others have indicated, the line terminator is kind of unnecessary with Python 3 (unless you need something really weird). If you actually need to specify a delimiter, I think giving a set of candidate delimiters would be sufficient. The first one encountered wins.

    Maybe I'm just getting old and cranky, but deprecation is the fork in the road I'd take, given the choice. Second choice would be to simplify the delimiter sniffing logic and get rid of anything to do with line terminators.

    Skip