Closed frafra closed 2 years ago
rows_from_file()
isn't part of the documented API but maybe it should be!
Steps to demonstrate that sqlite-utils insert
is not affected:
curl -o artsdatabanken.csv https://artsdatabanken.no/Fab2018/api/export/csv
sqlite-utils insert arts.db artsdatabanken artsdatabanken.csv --sniff --csv --encoding utf-16le
I don't understand why that works but calling insert_all()
does not.
Fixing that key
thing (to ignore any key that is None
) revealed a new bug:
File ~/Dropbox/Development/sqlite-utils/sqlite_utils/utils.py:376, in hash_record(record, keys)
373 if keys is not None:
374 to_hash = {key: record[key] for key in keys}
375 return hashlib.sha1(
--> 376 json.dumps(to_hash, separators=(",", ":"), sort_keys=True, default=repr).encode(
377 "utf8"
378 )
379 ).hexdigest()
File ~/.pyenv/versions/3.8.2/lib/python3.8/json/__init__.py:234, in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
232 if cls is None:
233 cls = JSONEncoder
--> 234 return cls(
235 skipkeys=skipkeys, ensure_ascii=ensure_ascii,
236 check_circular=check_circular, allow_nan=allow_nan, indent=indent,
237 separators=separators, default=default, sort_keys=sort_keys,
238 **kw).encode(obj)
File ~/.pyenv/versions/3.8.2/lib/python3.8/json/encoder.py:199, in JSONEncoder.encode(self, o)
195 return encode_basestring(o)
196 # This doesn't pass the iterator directly to ''.join() because the
197 # exceptions aren't as detailed. The list call should be roughly
198 # equivalent to the PySequence_Fast that ''.join() would do.
--> 199 chunks = self.iterencode(o, _one_shot=True)
200 if not isinstance(chunks, (list, tuple)):
201 chunks = list(chunks)
File ~/.pyenv/versions/3.8.2/lib/python3.8/json/encoder.py:257, in JSONEncoder.iterencode(self, o, _one_shot)
252 else:
253 _iterencode = _make_iterencode(
254 markers, self.default, _encoder, self.indent, floatstr,
255 self.key_separator, self.item_separator, self.sort_keys,
256 self.skipkeys, _one_shot)
--> 257 return _iterencode(o, 0)
TypeError: '<' not supported between instances of 'NoneType' and 'str'
Here are full steps to replicate the bug:
from urllib.request import urlopen
import sqlite_utils
db = sqlite_utils.Database(memory=True)
with urlopen("https://artsdatabanken.no/Fab2018/api/export/csv") as fab:
reader, other = sqlite_utils.utils.rows_from_file(fab, encoding="utf-16le")
db["fab2018"].insert_all(reader, pk="Id")
Aha! I think I see what's happening here. Here's what DictReader
does if one of the lines has too many items in it:
>>> import csv, io
>>> list(csv.DictReader(io.StringIO("id,name\n1,Cleo,nohead\n2,Barry")))
[{'id': '1', 'name': 'Cleo', None: ['nohead']}, {'id': '2', 'name': 'Barry'}]
See how that row with too many items gets this:
[{'id': '1', 'name': 'Cleo', None: ['nohead']}
That's a None
for the key and (weirdly) a list containing the single item for the value!
That weird behaviour is documented here: https://docs.python.org/3/library/csv.html#csv.DictReader
If a row has more fields than fieldnames, the remaining data is put in a list and stored with the fieldname specified by restkey (which defaults to
None
). If a non-blank row has fewer fields than fieldnames, the missing values are filled-in with the value of restval (which defaults toNone
).
So I need to make a design decision here: what should sqlite-utils
do with CSV files that have rows with more values than there are headings?
Some options:
None
is not a valid column name, so not that). This could be something like _restkey
or _values_with_no_heading
. This feels like a better option, but I'd need to carefully pick a name for it - and come up with an answer for the question of what to do if the CSV file being important already uses that heading name for something else.Whatever I decide, I can implement it in rows_from_file()
, maybe as an optional parameter - then decide how to call it from the sqlite-utils insert
CLI (perhaps with a new option there too).
Here's the current function signature for rows_from_file()
:
Decision: I'm going to default to raising an exception if a row has too many values in it.
You'll be able to pass ignore_extras=True
to ignore those extra values, or pass restkey="the_rest"
to stick them in a list in the restkey
column.
The exception will be called RowError
.
Interesting challenge in writing tests for this: if you give csv.Sniffer
a short example with an invalid row in it sometimes it picks the wrong delimiter!
id,name\r\n1,Cleo,oops
It decided the delimiter there was e
.
I think that's unavoidable: it looks like csv.Sniffer
only works if you feed it a CSV file with an equal number of values in each row, which is understandable.
That broke mypy
:
sqlite_utils/utils.py:229: error: Incompatible types in assignment (expression has type "Iterable[Dict[Any, Any]]", variable has type "DictReader[str]")
Getting this past mypy
is really hard!
% mypy sqlite_utils
sqlite_utils/utils.py:189: error: No overload variant of "pop" of "MutableMapping" matches argument type "None"
sqlite_utils/utils.py:189: note: Possible overload variants:
sqlite_utils/utils.py:189: note: def pop(self, key: str) -> str
sqlite_utils/utils.py:189: note: def [_T] pop(self, key: str, default: Union[str, _T] = ...) -> Union[str, _T]
That's because of this line:
row.pop(key=None)
Which is legit here - we have a dictionary where one of the keys is None
and we want to remove that key. But the baked in type is apparently def pop(self, key: str) -> str
.
Filed an issue against python/typeshed
:
I'm going to rename restkey
to extras_key
for consistency with ignore_extras
.
I forgot to add equivalents of extras_key=
and ignore_extras=
to the CLI tool - will do that in a separate issue.
Original title: csv.DictReader can have None as key
In some cases,
csv.DictReader
can haveNone
as key for unnamed columns, and a list of values as value.sqlite_utils.utils.rows_from_file
cannot handle that:Result:
Code: https://github.com/simonw/sqlite-utils/blob/59be60c471fd7a2c4be7f75e8911163e618ff5ca/sqlite_utils/db.py#L3454
sqlite-utils insert
from command line is not affected by this issue.