harelba / q

q - Run SQL directly on delimited files and multi-file sqlite databases
http://harelba.github.io/q/
GNU General Public License v3.0
10.19k stars 421 forks source link

ER: allow null header values, at least for right-most columns #252

Open pkoppstein opened 3 years ago

pkoppstein commented 3 years ago

I wish to use q with some very large TSV files, the header line of each of which has trailing tabs corresponding to certain deliberately unnamed columns.

It would of course be possible to copy these files and add dummy headers, but that would be something of a hassle in various ways, so I would like to request an enhancement so that some reasonable default behavior is supported that would allow q to be directly used on such files.

One obvious possibility that other similar tools use is to provide default names (e.g. perhaps cN for the unnamed column N). This would presumably entail few if any implementation complications so I won't enumerate other possibilities here.

Thanks!

harelba commented 3 years ago

This is supposed to be supported by using the -c N parameter:

$ cat example-file
a,b,c
10,20,30,40,50
10,20,30,40,50
$ q -c 5 -H -d , "select a,b,c,c4,c5 from example-file" -A
Table for file: example-file
  `a` - int
  `b` - int
  `c` - int
  `c4` - int
  `c5` - int
$ q -c 5 -H -d , "select a,b,c,c4,c5 from example-file" -O
a,b,c,c4,c5
10,20,30,40,50
10,20,30,40,50
$ q -c 5 -H -d , "select a,b,c,c4,c5 from example-file" -O -m strict
Strict mode. Header row contains less columns than expected column count(3 vs 5)

The default parsing mode is relaxed so this should work in the default mode. As you can see above, in strict mode, this throws out an error.

If this doesn't work properly, can you please provide an example file with an identical structure (few lines of data would suffice)?

pkoppstein commented 3 years ago

Thanks for your response. Even if the -c option resolved the issue, it would in my opinion still be better to adhere to "Postel's law".

As it happens, though:

sed $'s/\t/X/g' q.tsv
aXbX
1X2
10X20

$ q -H -c 3 -t 'select count(*) from q.tsv'
Bad header row: Header must contain only strings and not numbers or empty strings: 'a,b,'
'': Column name must be a string
$ 

I've also tried various variations with the -m option....

harelba commented 3 years ago

This is actually a different use-case. It's not that the number of headers is smaller than the data, it's a case where the actual name of a header is an empty string (due to the header row ending in a \t).

This is indeed not supported currently, and q could benefit from having such an option. I'll take a look.