Closed simonw closed 3 years ago
I could combine this with #131 to allow types to be specified in addition to column names.
Probably need an option that means "ignore the existing heading row and use this one instead".
For the moment, a workaround can be to cat
an additional row onto the start of the file.
echo "name,url,description" | cat - missing_headings.csv | sqlite-utils insert blah.db table - --csv
I'm going to detach this from the #131 column types idea.
The three things I need to handle here are:
unknown1,unknown2...
so I can start exploring it as quickly as possible.Here's a potential design that covers the first two:
--replace-header="foo,bar,baz"
- ignore whatever is in the first row and pretend it was this instead
--add-header="foo,bar,baz"
- add a first row with these details, to use as the header
It doesn't cover the "give me unknown column names" case though.
Another pattern that might be useful is to generate a header that is just "unknown1,unknown2,unknown3" for each of the columns in the rest of the file. This makes it easy to e.g. facet-explore within Datasette to figure out the correct names, then use sqlite-utils transform --rename
to rename the columns.
I needed to do that for the https://bl.iro.bl.uk/work/ns/3037474a-761c-456d-a00c-9ef3c6773f4c example.
I just spotted that csv.Sniffer
in the Python standard library has a .has_header(sample)
method which detects if the first row appears to be a header or not, which is interesting. https://docs.python.org/3/library/csv.html#csv.Sniffer
Implementation tip: I have code that reads the first row and uses it as headers here: https://github.com/simonw/sqlite-utils/blob/8f042ae1fd323995d966a94e8e6df85cc843b938/sqlite_utils/cli.py#L689-L691
So If I want to use unknown1,unknown2...
I can do that by reading the first row, counting the number of columns, generating headers based on that range and then continuing to build that generator (maybe with itertools.chain()
to replay the record we already read).
I'm not convinced the .has_header()
rules are useful for the kind of CSV files I work with: https://github.com/python/cpython/blob/63298930fb531ba2bb4f23bc3b915dbf1e17e9e1/Lib/csv.py#L383
def has_header(self, sample):
# Creates a dictionary of types of data in each column. If any
# column is of a single type (say, integers), *except* for the first
# row, then the first row is presumed to be labels. If the type
# can't be determined, it is assumed to be a string in which case
# the length of the string is the determining factor: if all of the
# rows except for the first are the same length, it's a header.
# Finally, a 'vote' is taken at the end for each column, adding or
# subtracting from the likelihood of the first row being a header.
For the moment I think just adding --no-header
- which causes column names "unknown1,unknown2,..." to be used - should be enough.
Users can import with that option, then use sqlite-utils transform --rename
to rename them.
I called this --no-headers
for consistency with the existing output option: https://github.com/simonw/sqlite-utils/blob/427dace184c7da57f4a04df07b1e84cdae3261e8/sqlite_utils/cli.py#L61-L64
--no-headers
does not work?
$ echo 'a,1\nb,2' | sqlite-utils memory --no-headers -t - 'select * from stdin'
a 1
--- ---
b 2
https://bl.iro.bl.uk/work/ns/3037474a-761c-456d-a00c-9ef3c6773f4c has a fascinating CSV file that doesn't have a header row - it starts like this:
It would be useful if
sqlite-utils insert ... --csv
had a mechanism for importing files like this one.