--no-headers option for CSV and TSV

simonw commented 3 years ago

https://bl.iro.bl.uk/work/ns/3037474a-761c-456d-a00c-9ef3c6773f4c has a fascinating CSV file that doesn't have a header row - it starts like this:

Computation and measurement of turbulent flow through idealized turbine blade passages,,"Loizou, Panos A.",https://isni.org/isni/0000000136122593,,University of Manchester,https://isni.org/isni/0000000121662407,1989,Thesis (Ph.D.),,Physical Sciences,,,https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.232781,
"Prolactin and growth hormone secretion in normal, hyperprolactinaemic and acromegalic man",,"Prescott, R. W. G.",https://isni.org/isni/0000000134992122,,University of Newcastle upon Tyne,https://isni.org/isni/0000000104627212,1983,Thesis (Ph.D.),,Biological Sciences,,,https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.232784,

It would be useful if sqlite-utils insert ... --csv had a mechanism for importing files like this one.

simonw commented 3 years ago

I could combine this with #131 to allow types to be specified in addition to column names.

Probably need an option that means "ignore the existing heading row and use this one instead".

simonw commented 3 years ago

For the moment, a workaround can be to cat an additional row onto the start of the file.

echo "name,url,description" | cat - missing_headings.csv | sqlite-utils insert blah.db table - --csv

simonw commented 3 years ago

I'm going to detach this from the #131 column types idea.

The three things I need to handle here are:

The CSV file doesn't have a header row at all, so I need to specify what the column names should be
The CSV file DOES have a header row but I want to ignore it and use alternative column names
The CSV doesn't have a header row at all and I want to automatically use unknown1,unknown2... so I can start exploring it as quickly as possible.

Here's a potential design that covers the first two:

--replace-header="foo,bar,baz" - ignore whatever is in the first row and pretend it was this instead --add-header="foo,bar,baz" - add a first row with these details, to use as the header

It doesn't cover the "give me unknown column names" case though.

simonw commented 3 years ago

Another pattern that might be useful is to generate a header that is just "unknown1,unknown2,unknown3" for each of the columns in the rest of the file. This makes it easy to e.g. facet-explore within Datasette to figure out the correct names, then use sqlite-utils transform --rename to rename the columns.

I needed to do that for the https://bl.iro.bl.uk/work/ns/3037474a-761c-456d-a00c-9ef3c6773f4c example.

simonw commented 3 years ago

I just spotted that csv.Sniffer in the Python standard library has a .has_header(sample) method which detects if the first row appears to be a header or not, which is interesting. https://docs.python.org/3/library/csv.html#csv.Sniffer

simonw commented 3 years ago

Implementation tip: I have code that reads the first row and uses it as headers here: https://github.com/simonw/sqlite-utils/blob/8f042ae1fd323995d966a94e8e6df85cc843b938/sqlite_utils/cli.py#L689-L691

So If I want to use unknown1,unknown2... I can do that by reading the first row, counting the number of columns, generating headers based on that range and then continuing to build that generator (maybe with itertools.chain() to replay the record we already read).

simonw commented 3 years ago

I'm not convinced the .has_header() rules are useful for the kind of CSV files I work with: https://github.com/python/cpython/blob/63298930fb531ba2bb4f23bc3b915dbf1e17e9e1/Lib/csv.py#L383

    def has_header(self, sample):
        # Creates a dictionary of types of data in each column. If any
        # column is of a single type (say, integers), *except* for the first
        # row, then the first row is presumed to be labels. If the type
        # can't be determined, it is assumed to be a string in which case
        # the length of the string is the determining factor: if all of the
        # rows except for the first are the same length, it's a header.
        # Finally, a 'vote' is taken at the end for each column, adding or
        # subtracting from the likelihood of the first row being a header.

simonw commented 3 years ago

For the moment I think just adding --no-header - which causes column names "unknown1,unknown2,..." to be used - should be enough.

Users can import with that option, then use sqlite-utils transform --rename to rename them.

simonw commented 3 years ago

I called this --no-headers for consistency with the existing output option: https://github.com/simonw/sqlite-utils/blob/427dace184c7da57f4a04df07b1e84cdae3261e8/sqlite_utils/cli.py#L61-L64

agguser commented 2 years ago

--no-headers does not work?

$ echo 'a,1\nb,2' | sqlite-utils memory --no-headers -t - 'select * from stdin'
a      1                                                                                                                             
---  ---                                                                                                                             
b      2

simonw / sqlite-utils

--no-headers option for CSV and TSV #228