python / cpython

The Python programming language
https://www.python.org
Other
62.28k stars 29.92k forks source link

csv.reader split error #84643

Open 9cf60f49-bd9a-40e1-977a-fbd2c323ccc1 opened 4 years ago

9cf60f49-bd9a-40e1-977a-fbd2c323ccc1 commented 4 years ago
BPO 40463
Nosy @ericvsmith
Files
  • 01_test_code.py: use csv.reader split a sentence, return columns
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['3.7', '3.8', 'type-bug', 'library'] title = 'csv.reader split error' updated_at = user = 'https://bugs.python.org/wy7305e' ``` bugs.python.org fields: ```python activity = actor = 'eric.smith' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'wy7305e' dependencies = [] files = ['49104'] hgrepos = [] issue_num = 40463 keywords = [] message_count = 2.0 messages = ['367823', '367825'] nosy_count = 2.0 nosy_names = ['eric.smith', 'wy7305e'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue40463' versions = ['Python 3.7', 'Python 3.8'] ```

    9cf60f49-bd9a-40e1-977a-fbd2c323ccc1 commented 4 years ago

    python 3.6 or python 3.8

    csv.reader

    delimiter=','
    quotechar='"'

    split this sentence:

    "A word of encouragement and explanation, of pity for my childish ignorance, of welcome home, of reassurance to me that it was home, might have made me dutiful to him in my heart henceforth, instead of in my hypocritical<eword w=\"hypocritical\"></eword> outside, and might have made me respect instead of hate him. ","Part 1/CHAPTER 4. I FALL INTO DISGRACE/","David Copperfield"

    return 4 columns, but it should return 3 columns.

    ericvsmith commented 4 years ago

    You should tell us what you're seeing, and what you're expecting.

    I'm adding the rest of this not because it solves your problem, but because it might help you or someone else troubleshoot this further.

    Here's a simpler reproducer:

    import csv
    lst = ['"A,"h"e, ","E","DC"']
    
    csv_list = csv.reader(lst)
    for idx, col in enumerate(next(csv_list)):
        print(idx, repr(col))

    Which produces: 0 'A,h"e' 1 ' "' 2 'E' 3 'DC'

    Although true to its word, this is using the default dialect='excel', and my version of Excel gives these same 4 columns, including the space starting the second column.

    Dropping the space after the "e," gives 3 columns:

    lst = ['"A,"h"e,","E","DC"']

    Produces: 0 'A,h"e' 1 ',E"' 2 'DC'

    Again, this is exactly what Excel gives, as odd as it seems.

    It might be worth playing around with the dialect parameters to see if you can achieve what you want. In your example: delimiter=',', quotechar='"' are the default values for the "excel" dialect, which is why I dropped them above.

    serhiy-storchaka commented 8 months ago

    As for the original problem, you should use escapechar='\\'.

    >>> row = r'''"A word of encouragement and explanation, of pity for my childish ignorance, of welcome home, of reassurance to me that it was home, might have made me dutiful to him in my heart henceforth, instead of in my hypocritical<eword w=\"hypocritical\"></eword> outside, and might have made me respect instead of hate him. ","Part 1/CHAPTER 4. I FALL INTO DISGRACE/","David Copperfield"'''
    >>> next(csv.reader([row]))
    ['A word of encouragement and explanation, of pity for my childish ignorance, of welcome home, of reassurance to me that it was home, might have made me dutiful to him in my heart henceforth, instead of in my hypocritical<eword w=\\hypocritical\\"></eword> outside', ' and might have made me respect instead of hate him. "', 'Part 1/CHAPTER 4. I FALL INTO DISGRACE/', 'David Copperfield']
    >>> next(csv.reader([row], escapechar='\\'))
    ['A word of encouragement and explanation, of pity for my childish ignorance, of welcome home, of reassurance to me that it was home, might have made me dutiful to him in my heart henceforth, instead of in my hypocritical<eword w="hypocritical"></eword> outside, and might have made me respect instead of hate him. ', 'Part 1/CHAPTER 4. I FALL INTO DISGRACE/', 'David Copperfield']

    @ericvsmith's examples is a different case. They do not contain escaping.

    >>> row1 = '"A,"h"e, ","E","DC"'
    >>> row2 = '"A,"h"e,","E","DC"'
    >>> next(csv.reader([row1]))
    ['A,h"e', ' "', 'E', 'DC']
    >>> next(csv.reader([row2]))
    ['A,h"e', ',E"', 'DC']

    A quote before h is not escaped and is not doubled, so it closes the "quoted field" mode, and the rest of the field is parsed in the "unquoted field" mode. The first field ends at the first unquoted comma. Then the second field starts.

    In the first example, it starts with a space, and since it is not a quote, the second field is parsed in the "unquoted field" mode. It includes a space and a quote character: ' "'.

    In the first example, it starts with a quote, which starts the "quoted field" mode. The comma between quotes is now quoted, so it is included in the second field. The quote before E ends the "quoted field" mode, so E and the following quote are parsed in the "unquoted field" mode and are included in the second field: ',E"'.

    If use skipinitialspace=True option, you get the same result in both examples:

    >>> next(csv.reader([row1], skipinitialspace=True))
    ['A,h"e', ',E"', 'DC']

    So all this is logical and consecutive. But it depends on one non-intuitive assumption: that the quote character in the "quoted field" mode just ends the "quoted field" mode even if it is not followed by the delimiter or EOL. Is it so in Excel and other CSV parsers?

    serhiy-storchaka commented 8 months ago

    When use strict=True, all examples raise an exception:

    >>> next(csv.reader([row], strict=True))
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
        next(csv.reader([row], strict=True))
        ~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    _csv.Error: ',' expected after '"'