Open 1cf21dd1-a46e-4b77-bcc2-5ab1583fc183 opened 7 years ago
Line terminator defaults to '\r\n' while detecting dialect in csv.Sniffer
The csv expert listed in https://devguide.python.org/experts/ is marked as inactive, and I have never used the module. So you might need to ask for help on core-mentorship list.
The csv doc for Sniffer.sniff says "Analyze the given sample and return a Dialect subclass reflecting the parameters found." It is not clear to me whether 'the parameters found' is meant to be all possible parameters or just those found. So, to be conservative, I will initially treat this an a feature addition for the the next version, rather than a bug to also be fixed in current versions. It does seem like a reasonable request.
Looking at the code and docstring, lineterminator was intentionally (knowingly) not sniffed, making this a feature addition.
While Sniffer *returns a dialect with lineterminator = '\r\n', it *uses '\n' for splitting. This is slightly odd, as it leaves lines terminated by '\r' while detecting within-line parameters, but it does not affect such detection.
Are there csv files in the wild that use \r as line terminator. If so, they will not currently get split.
There is another issue related to this. If you use codecs to get a reader, it uses str.splitlines() internally, which treats a bunch of different characters as line terminators. See issue bpo-18291 and:
https://docs.python.org/3.8/library/stdtypes.html#str.splitlines
I was thinking about different ways to fix this. First, the csv module suggests you pass newline='' to the file object. I suspect most people don't know to do that. So, I thought maybe the csv module should inspect the file object that gets passed in and then warn if newline='' has not been used or if the file is a codecs reader object.
However, that seems fairly complicated. Would it be better if we changed the 'csv' module to do its own line splitting? I think that would be better although I'm not sure about backwards compatibly. Currently, the reader expects to call iter() on the input file. Would it be okay if it used the 'read' method of it in preference to using iter()? It could still fallback to iter() if there was no read method.
Note that the current CSV parser in _csv.c doesn't require the line terminator, it eats up \r and \n where necessary. See:
https://github.com/python/cpython/blob/fd512d76456b65c529a5bc58d8cfe73e4a10de7a/Modules/_csv.c#L752
This is why the line terminator isn't detected and doesn't need to be detected.
Also, files that use the \r line terminator exist and are parsed correctly at the moment. See for example: https://raw.githubusercontent.com/hadley/data-fuel-economy/master/1998-2008/2008.csv
A couple comments.
The csv expert listed in https://devguide.python.org/experts/ is marked as inactive
That would be me. I am indeed inactive w.r.t. fixing broken stuff, and don't want to feel obligated to jump in with both feet when a CSV ticket is raised. Still, I keep half an eye on things. If people are actually interested in my opinion on such stuff, drop me a line.
Maybe I'm just getting old and cranky, but deprecation is the fork in the road I'd take, given the choice. Second choice would be to simplify the delimiter sniffing logic and get rid of anything to do with line terminators.
Skip
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at = None created_at =
labels = ['3.8', 'type-feature', 'library']
title = 'csv.Sniffer does not detect lineterminator'
updated_at =
user = 'https://github.com/vmax'
```
bugs.python.org fields:
```python
activity =
actor = 'vmax'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation =
creator = 'vmax'
dependencies = []
files = []
hgrepos = []
issue_num = 30825
keywords = ['patch']
message_count = 7.0
messages = ['297497', '311804', '311805', '311806', '327094', '329482', '329487']
nosy_count = 5.0
nosy_names = ['skip.montanaro', 'nascheme', 'terry.reedy', 'vmax', 'Gertjan van den Burg']
pr_nums = ['2529', '18336']
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue30825'
versions = ['Python 3.8']
```