CSV files with too many values in a row cause errors - Githubissues

simonw / sqlite-utils

Python CLI utility and library for manipulating SQLite databases

https://sqlite-utils.datasette.io

Apache License 2.0

1.65k stars 111 forks source link

CSV files with too many values in a row cause errors #440

Closed frafra closed 2 years ago

frafra commented 2 years ago

Original title: csv.DictReader can have None as key

In some cases, csv.DictReader can have None as key for unnamed columns, and a list of values as value. sqlite_utils.utils.rows_from_file cannot handle that:

url="https://artsdatabanken.no/Fab2018/api/export/csv"
db = sqlite_utils.Database(":memory")

with urlopen(url) as fab:
    reader, _ = sqlite_utils.utils.rows_from_file(fab, encoding="utf-16le")   
    db["fab2018"].insert_all(reader, pk="Id")

Result:

Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
  File "/home/user/.local/pipx/venvs/sqlite-utils/lib/python3.8/site-packages/sqlite_utils/db.py", line 2924, in insert_all
    chunk = list(chunk)
  File "/home/user/.local/pipx/venvs/sqlite-utils/lib/python3.8/site-packages/sqlite_utils/db.py", line 3454, in fix_square_braces
    if any("[" in key or "]" in key for key in record.keys()):
  File "/home/user/.local/pipx/venvs/sqlite-utils/lib/python3.8/site-packages/sqlite_utils/db.py", line 3454, in <genexpr>
    if any("[" in key or "]" in key for key in record.keys()):
TypeError: argument of type 'NoneType' is not iterable

Code: https://github.com/simonw/sqlite-utils/blob/59be60c471fd7a2c4be7f75e8911163e618ff5ca/sqlite_utils/db.py#L3454

sqlite-utils insert from command line is not affected by this issue.

simonw commented 2 years ago

rows_from_file() isn't part of the documented API but maybe it should be!

simonw commented 2 years ago

Steps to demonstrate that sqlite-utils insert is not affected:

curl -o artsdatabanken.csv https://artsdatabanken.no/Fab2018/api/export/csv
sqlite-utils insert arts.db artsdatabanken artsdatabanken.csv --sniff --csv --encoding utf-16le

simonw commented 2 years ago

I don't understand why that works but calling insert_all() does not.

simonw commented 2 years ago

Fixing that key thing (to ignore any key that is None) revealed a new bug:

File ~/Dropbox/Development/sqlite-utils/sqlite_utils/utils.py:376, in hash_record(record, keys)
    373 if keys is not None:
    374     to_hash = {key: record[key] for key in keys}
    375 return hashlib.sha1(
--> 376     json.dumps(to_hash, separators=(",", ":"), sort_keys=True, default=repr).encode(
    377         "utf8"
    378     )
    379 ).hexdigest()

File ~/.pyenv/versions/3.8.2/lib/python3.8/json/__init__.py:234, in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
    232 if cls is None:
    233     cls = JSONEncoder
--> 234 return cls(
    235     skipkeys=skipkeys, ensure_ascii=ensure_ascii,
    236     check_circular=check_circular, allow_nan=allow_nan, indent=indent,
    237     separators=separators, default=default, sort_keys=sort_keys,
    238     **kw).encode(obj)

File ~/.pyenv/versions/3.8.2/lib/python3.8/json/encoder.py:199, in JSONEncoder.encode(self, o)
    195         return encode_basestring(o)
    196 # This doesn't pass the iterator directly to ''.join() because the
    197 # exceptions aren't as detailed.  The list call should be roughly
    198 # equivalent to the PySequence_Fast that ''.join() would do.
--> 199 chunks = self.iterencode(o, _one_shot=True)
    200 if not isinstance(chunks, (list, tuple)):
    201     chunks = list(chunks)

File ~/.pyenv/versions/3.8.2/lib/python3.8/json/encoder.py:257, in JSONEncoder.iterencode(self, o, _one_shot)
    252 else:
    253     _iterencode = _make_iterencode(
    254         markers, self.default, _encoder, self.indent, floatstr,
    255         self.key_separator, self.item_separator, self.sort_keys,
    256         self.skipkeys, _one_shot)
--> 257 return _iterencode(o, 0)

TypeError: '<' not supported between instances of 'NoneType' and 'str'

simonw commented 2 years ago

Here are full steps to replicate the bug:

from urllib.request import urlopen
import sqlite_utils
db = sqlite_utils.Database(memory=True)
with urlopen("https://artsdatabanken.no/Fab2018/api/export/csv") as fab:
    reader, other = sqlite_utils.utils.rows_from_file(fab, encoding="utf-16le")
    db["fab2018"].insert_all(reader, pk="Id")

simonw commented 2 years ago

Aha! I think I see what's happening here. Here's what DictReader does if one of the lines has too many items in it:

>>> import csv, io
>>> list(csv.DictReader(io.StringIO("id,name\n1,Cleo,nohead\n2,Barry")))
[{'id': '1', 'name': 'Cleo', None: ['nohead']}, {'id': '2', 'name': 'Barry'}]

See how that row with too many items gets this: [{'id': '1', 'name': 'Cleo', None: ['nohead']}

That's a None for the key and (weirdly) a list containing the single item for the value!

simonw commented 2 years ago

That weird behaviour is documented here: https://docs.python.org/3/library/csv.html#csv.DictReader

If a row has more fields than fieldnames, the remaining data is put in a list and stored with the fieldname specified by restkey (which defaults to None). If a non-blank row has fewer fields than fieldnames, the missing values are filled-in with the value of restval (which defaults to None).

simonw commented 2 years ago

So I need to make a design decision here: what should sqlite-utils do with CSV files that have rows with more values than there are headings?

Some options:

Ignore those extra fields entirely - silently drop that data. I'm not keen on this.
Throw an error. The library does this already, but the error is incomprehensible - it could turn into a useful, human-readable error instead.
Put the data in a JSON list in a column with a known name (None is not a valid column name, so not that). This could be something like _restkey or _values_with_no_heading. This feels like a better option, but I'd need to carefully pick a name for it - and come up with an answer for the question of what to do if the CSV file being important already uses that heading name for something else.

simonw commented 2 years ago

Whatever I decide, I can implement it in rows_from_file(), maybe as an optional parameter - then decide how to call it from the sqlite-utils insert CLI (perhaps with a new option there too).

simonw commented 2 years ago

Here's the current function signature for rows_from_file():

https://github.com/simonw/sqlite-utils/blob/26e6d2622c57460a24ffdd0128bbaac051d51a5f/sqlite_utils/utils.py#L174-L179

simonw commented 2 years ago

Decision: I'm going to default to raising an exception if a row has too many values in it.

You'll be able to pass ignore_extras=True to ignore those extra values, or pass restkey="the_rest" to stick them in a list in the restkey column.

simonw commented 2 years ago

The exception will be called RowError.

simonw commented 2 years ago

Interesting challenge in writing tests for this: if you give csv.Sniffer a short example with an invalid row in it sometimes it picks the wrong delimiter!

id,name\r\n1,Cleo,oops

It decided the delimiter there was e.

simonw commented 2 years ago

I think that's unavoidable: it looks like csv.Sniffer only works if you feed it a CSV file with an equal number of values in each row, which is understandable.

simonw commented 2 years ago

That broke mypy:

sqlite_utils/utils.py:229: error: Incompatible types in assignment (expression has type "Iterable[Dict[Any, Any]]", variable has type "DictReader[str]")

simonw commented 2 years ago

Getting this past mypy is really hard!

% mypy sqlite_utils
sqlite_utils/utils.py:189: error: No overload variant of "pop" of "MutableMapping" matches argument type "None"
sqlite_utils/utils.py:189: note: Possible overload variants:
sqlite_utils/utils.py:189: note:     def pop(self, key: str) -> str
sqlite_utils/utils.py:189: note:     def [_T] pop(self, key: str, default: Union[str, _T] = ...) -> Union[str, _T]

That's because of this line:

row.pop(key=None)

Which is legit here - we have a dictionary where one of the keys is None and we want to remove that key. But the baked in type is apparently def pop(self, key: str) -> str.

simonw commented 2 years ago

Filed an issue against python/typeshed:

https://github.com/python/typeshed/issues/8075

simonw commented 2 years ago

I'm going to rename restkey to extras_key for consistency with ignore_extras.

simonw commented 2 years ago

Documentation: https://sqlite-utils.datasette.io/en/latest/python-api.html#reading-rows-from-a-file

simonw commented 2 years ago

I forgot to add equivalents of extras_key= and ignore_extras= to the CLI tool - will do that in a separate issue.