simonw / sqlite-utils

Python CLI utility and library for manipulating SQLite databases
https://sqlite-utils.datasette.io
Apache License 2.0
1.58k stars 106 forks source link

Handling CSV/file input that contains NUL bytes #582

Open betatim opened 11 months ago

betatim commented 11 months ago

I was using sqlite-utils to create a DB from a CSV and it turns out the CSV contains a NUL byte.

When the processing reaches the line that contains the NUL an exception is raised.

I'm wondering if there is something that can be done in sqlite-utils to say "skip lines with encoding errors" or some such. I think it isn't super straightforward though as the exception comes from inside the csv module that does all the parsing.

Concretely the file is the KernelVersions.csv from https://www.kaggle.com/datasets/kaggle/meta-kaggle

This is the command and output:

$ sqlite-utils insert --csv kaggle.db kaggle KernelVersions.csv
  [------------------------------------]    0%
  [#####################---------------]   60%  00:04:24Traceback (most recent call last):
  File "/home/foobar/miniconda/envs/meta-kaggle/bin/sqlite-utils", line 10, in <module>
    sys.exit(cli())
  File "/home/foobar/miniconda/envs/meta-kaggle/lib/python3.10/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/foobar/miniconda/envs/meta-kaggle/lib/python3.10/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/foobar/miniconda/envs/meta-kaggle/lib/python3.10/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/foobar/miniconda/envs/meta-kaggle/lib/python3.10/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/foobar/miniconda/envs/meta-kaggle/lib/python3.10/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/foobar/miniconda/envs/meta-kaggle/lib/python3.10/site-packages/sqlite_utils/cli.py", line 1223, in insert
    insert_upsert_implementation(
  File "/home/foobar/miniconda/envs/meta-kaggle/lib/python3.10/site-packages/sqlite_utils/cli.py", line 1085, in insert_upsert_implementation
    db[table].insert_all(
  File "/home/foobar/miniconda/envs/meta-kaggle/lib/python3.10/site-packages/sqlite_utils/db.py", line 3198, in insert_all
    chunk = list(chunk)
  File "/home/foobar/miniconda/envs/meta-kaggle/lib/python3.10/site-packages/sqlite_utils/db.py", line 3742, in fix_square_braces
    for record in records:
  File "/home/foobar/miniconda/envs/meta-kaggle/lib/python3.10/site-packages/sqlite_utils/cli.py", line 1071, in <genexpr>
    docs = (decode_base64_values(doc) for doc in docs)
  File "/home/foobar/miniconda/envs/meta-kaggle/lib/python3.10/site-packages/sqlite_utils/cli.py", line 1068, in <genexpr>
    docs = (verify_is_dict(doc) for doc in docs)
  File "/home/foobar/miniconda/envs/meta-kaggle/lib/python3.10/site-packages/sqlite_utils/cli.py", line 1003, in <genexpr>
    docs = (dict(zip(headers, row)) for row in reader)
_csv.Error: line contains NUL