harelba / q

q - Run SQL directly on delimited files and multi-file sqlite databases
http://harelba.github.io/q/
GNU General Public License v3.0
10.21k stars 423 forks source link

Smarter field delimiter detection and handling of column names #167

Open dclong opened 6 years ago

dclong commented 6 years ago

The tool q is useful but has limitations. I think the following features can be really useful additions to it.

  1. Smart auto detection of (best) field delimiters. It's appealing to use q to query output of Linux commands but unfortunately output of Linux commands are often not well structured. It can be a tab or space or multiple spaces. Also, some program output results as ASCII tables like the one below. All these non standard formatting makes it inconvenient to use q.
    |----|----|
    | c1 | c2 | 
    |----|----|
    | v1 | v2 | 
    |----|----|
  2. Support better handling of column names. I suggest to normalize column names, e.g., replace special characters such spaces, commas, etc. with underscores. Also, no matter there's a header or not, it's appealing to be able to refer to a column using index. _n is probably better than cn (where n is an integer) as users might have text files using c1, c2, etc. as column names. It's even better if we can reference a column using first k characters of the column names if it can already unique identify a column. For example, if a column is named repository, allow users to use rep to reference the column is there's no other columns starting with rep.

Many people use shell because it's quick and dirty. If q is not "quick", there's not much advantage over paring results using awk or even Python.

bitti commented 6 years ago
  1. q follows UNIX philosophy, therefore it's supposed to be predictable rather than trying to be smart. That also means it works great together with other UNIX tools. If a tool breaks that contract by outputting superfluous ornaments you need to write a special filter for them to massage the output into a proper format. But even such tools often have a flag to produce a machine-readable output, therefore you should study their documentation carefully before investing time in writing a wrapper.

  2. Why would that be an advantage? You could as well have a table with _1, _2 etc as column names. And if you have column names you might as well read them in with -H.

harelba commented 6 years ago

Hi @dclong , I'm glad you find q useful.

I agree with @bitti regarding fitting the unix philosophy, I believe that data cleanup/formatting should be done by using other, more suitable, linux tools. q doesn't attempt to replace the entire linux command-set, quite the contrary, at least from my viewpoint.

Regarding the automatic column name, I don't really believe that there's a significant difference between choosing cN and _N, both are short enough and anyone can get used to either without any implications.

As for "auto-completing" column names, I believe that something like that would be less predictable and stable. For example, think about a query which uses a column prefix (e.g. phone) that runs on a csv file. If the user will run the same query on a newer version of the csv which contains another column with the same prefix (e.g. phone2), then the query would either break or provide incorrect results. Such behavior is not something I would expect from a tool such as q, and the benefits of such auto-completion are marginal in my view.

The main advantages of q from my point of view are:

  1. It can auto-detect column types and handle encodings properly, treating text as actual data and not just as "characters"
  2. It provides a very familiar interface - sqlite3's SQL is very powerful and yet very mature
  3. It integrates well with the inherent composability of linux tools.

I'm not expecting q's traits to be suitable for every use case. All I can hope is that some users will find it useful for them.

Harel