harelba / q

q - Run SQL directly on delimited files and multi-file sqlite databases
http://harelba.github.io/q/
GNU General Public License v3.0
10.19k stars 422 forks source link

Auto-detection of commas in column names #118

Open llagerlof opened 8 years ago

llagerlof commented 8 years ago

I have this "people.csv" file:

id,name,age,gender
1,Lawrence,39,M
2,Oliver,25,M
3,Roberta,17,F
4,Bell,70,F
5,Dalek,12,M
6,Andrews,10,M
7,Marty,35,M
8,"Anna Lammas",40,F
9,"Marcos Vinicius",16,M
10,Lara,32,F

If I try to run the command below, an error is raised.

C:\test\q -H "select * from ./people.csv where gender = 'F'"

Bad header row: Header must contain only strings and not numbers or empty strings: 'id,name,age,gender'
'id,name,age,gender': Column name cannot contain commas

However, this command works:

C:\test>q -H -d , "select * from ./people.csv where gender = 'F'"
3,Roberta,17,F
4,Bell,70,F
8,Anna Lammas,40,F
10,Lara,32,F

Well, if a CSV file is, by default, comma separated, q should identify the first row as comma separated too, I think.

Thanks!

bitti commented 8 years ago

Even though the term "CSV" comes from "Comma Separated Values" the file extension "csv" doesn't define the delimiter. From https://en.wikipedia.org/wiki/Comma-separated_values:

In popular usage, however, the term CSV may denote some closely related delimiter-separated formats, which use a variety of different field delimiters.

That's also the reason you have to specify the delimiter when you import a .csv into a spreadsheet application.

llagerlof commented 8 years ago

Understood, thanks.

But then the -d parameter should be always mandatory. Maybe q could raise an error or warning if the user not specify it.

bitti commented 8 years ago

That being said: there is already a feature request for automatic delimiter detection (https://github.com/harelba/q/issues/93). Even though I argued against it I guess it would be fine to have an explicit flag to have at least a detection for the most simple cases.

llagerlof commented 8 years ago

Ok. I think I can live with the always-present -d option.

And I will wait this automatic delimiter detection.

@harelba, you can close this issue, if you wish. Thanks.

harelba commented 8 years ago

Hi all sorry for the late reply, I'm on a long vacation in south east Asia and Internet here is intermittent :)

Anyway, I'll probably implement the automatic delimiter detection right after I come back from the vacation, shouldn't be too hard. The original reason for space being the default delimiter was actually because many Linux tools output their table-like data using spaces, and I didn't want to break compatibility by changing the default.

Thank you all for the input. Hares

JiFish commented 3 years ago

I'm not sure if this project is still in active development, but I found it very useful so thank you. But the default delimiter did catch me out.

If auto-detection is far off, I would also +1 changing the default. I know, people should RTFM, but the other posters are right that this is going to be 90% of people's use cases.

Anyway, thanks for the software.

harelba commented 3 years ago

Hi, thanks @llagerlof about the post, and everyone for responding.

The reason for space being the default delimiter is because of very old backward compatibility reasons.

I'm just finishing a very large change in q (v3), which will provide a lot more functionality, but will break full-blown backward compatibility for some minor things.

Hopefully, I'll be able to implement the delimiter autodetection as part of this release, or at least changing the default to ,, which is probably what is used most of the times.