python / cpython

The Python programming language
https://www.python.org
Other
63.4k stars 30.36k forks source link

csv.Sniffer.snif doesn't set up the dialect properly for a csv created with dialect=csv.excel_tab and containing quote (") char #62029

Open b54bdd18-0ffa-412a-9e55-4a92cef9e562 opened 11 years ago

b54bdd18-0ffa-412a-9e55-4a92cef9e562 commented 11 years ago
BPO 17829
Files
  • csv_sniffing_excel_tab.py: Exemple of sniffing csv with dialect=csv.excel_tab and quote in data
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-bug'] title = 'csv.Sniffer.snif doesn\'t set up the dialect properly for a csv created with dialect=csv.excel_tab and containing quote (") char' updated_at = user = 'https://bugs.python.org/GhislainHivon' ``` bugs.python.org fields: ```python activity = actor = 'Antoon.Pardon' assignee = 'none' closed = False closed_date = None closer = None components = [] creation = creator = 'GhislainHivon' dependencies = [] files = ['30001'] hgrepos = [] issue_num = 17829 keywords = [] message_count = 3.0 messages = ['187709', '214800', '215031'] nosy_count = 3.0 nosy_names = ['GhislainHivon', 'Antoon.Pardon', 'dmi.baranov'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue17829' versions = ['Python 2.7', 'Python 3.2'] ```

    b54bdd18-0ffa-412a-9e55-4a92cef9e562 commented 11 years ago

    When sniffing the dialect of a file created with the csv module with dialect=csv.excel_tab and one of the row contain a quote ("), the delimiter is set to ' ' instead of '\t'.

    d8900210-577a-44e0-8708-ae85a5913450 commented 10 years ago

    I had a look at this and have the following remarks.

    1) the file csv_sniffing_excel_tab.py no longer works with python 3.3. It now produces the folowing traceback:

    Traceback (most recent call last):
      File "csv_sniffing_excel_tab.py", line 36, in <module>
        create_file()
      File "csv_sniffing_excel_tab.py", line 23, in create_file
        writer.writerows(test_data)
    TypeError: 'str' does not support the buffer interface

    2) The problem seems to be in the _guess_quote_and_delimiter method. If you always call _guess_delimiter, the sniffer give the correct result.

    3) As far as I understand the problem is the first regular expression: (?P\<delim>[^\w\n"\'])(?P\<space> ?)(?P\<quote>["\']).*?(?P=quote)(?P=delim)

    Now if we have a line as the following

    273:MVREGR1:ByEuPo:"Baryton ""Euphonium"" populaire"

    The delim group will match the space, the space group will match nothing the quote group will match " the non-group pattern will match "Euphonium" followed by the quote group matching " again and the delim group matching the space.

    And so we get the wrong delimiter.

    d8900210-577a-44e0-8708-ae85a5913450 commented 10 years ago

    I included a patch (against 2.7) that seems to make the test work.

    The patch prohibits the delim group to match a space.

    pradeepkumarai commented 1 year ago

    I included a patch (against 2.7) that seems to make the test work.

    The patch prohibits the delim group to match a space.