python / cpython

The Python programming language
https://www.python.org
Other
63.33k stars 30.32k forks source link

Csv sniffer doesn't attempt to determine and set escape character. #83273

Open a32c5283-a3ab-48d8-be08-8f801e65cfc7 opened 4 years ago

a32c5283-a3ab-48d8-be08-8f801e65cfc7 commented 4 years ago
BPO 39092
Nosy @evanw2

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-bug', 'library', '3.9'] title = "Csv sniffer doesn't attempt to determine and set escape character." updated_at = user = 'https://github.com/evanw2' ``` bugs.python.org fields: ```python activity = actor = 'evan.whitfield' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'evan.whitfield' dependencies = [] files = [] hgrepos = [] issue_num = 39092 keywords = [] message_count = 1.0 messages = ['358645'] nosy_count = 1.0 nosy_names = ['evan.whitfield'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue39092' versions = ['Python 3.9'] ```

a32c5283-a3ab-48d8-be08-8f801e65cfc7 commented 4 years ago

I observed a false positive for the csv sniffer has_header method. (It thought there was a header when there was not.) This is due to the fact that in has_header, it determines the csv dialect by sniffing it, and failed to determine that the file I was using had an escape character of '\'. Since it doesn't set the escape character, it then incorrectly broke the first line of the file into columns, since it encountered an escaped quote within a quoted column, and treated that as the end of that column. (It correctly determined that the dialect wasn't doublequote, but apparently still needs to have the escape character set to handle an escaped quotechar.)

I think one (or both) of these things should be done here to avoid this false positive: 1.) Allow a dialect to be passed to has_header, so that someone could specify the escape character of the dialect if it were known. 2.) Allow the sniff method of the Sniffer class to detect and set the escapechar.