harelba / q

q - Run SQL directly on delimited files and multi-file sqlite databases
http://harelba.github.io/q/
GNU General Public License v3.0
10.15k stars 419 forks source link

Inconsistent results depending on the contents of the datafile. #141

Open barrycforever opened 7 years ago

barrycforever commented 7 years ago

Hi,

I think this is a great idea! I tried to use "q", though I got inconsistent results with different text files:

C:\Users\barryc\Desktop>"C:\Program Files (x86)\q-TextAsData\q.exe" -H -d, "select Timestamp,Entity,max(Value) from .\textfile1.csv where Unit='%' and Entity='CHBARRYCAPP1' group by Entity, Timestamp" 1/10/2017,CHBARRYCAPP1,15.52

C:\Users\barryc\Desktop>"C:\Program Files (x86)\q-TextAsData\q.exe" -H -d, "select Timestamp,Entity,max(Value) from .\textfile2.csv where Unit='%' and Entity='CHBARRYCAPP1' group by Entity, Timestamp" 1/10/2017,CHBARRYCAPP1,7.33

C:\Users\barryc\Desktop>

The max value in the file is 15.52 , though it returns 7.33 when a line is added. Please see the second csv file.

Thanks for your help on this!, Barry

PS I've uploaded the files. I had to add a "txt" extension because this page wouldn't take csv files:

textfile1.csv.txt textfile2.csv.txt

bitti commented 7 years ago

It seems q gets confused when integers and floats are mixed in the same column and it falls back to string interpretation in this case (7 is the biggest first character in your example and hence 7.33 the "maximum" string). To force numeric interpretation you can just modify your query a little bit though:

select Timestamp,Entity,max(Value+0) from .\textfile2.csv where Unit='%' and Entity='CHBARRYCAPP1' group by Entity, Timestamp
barrycforever commented 7 years ago

A little disappointing, but the workaround gave the right result in the small csv and the original csv too. Thanks!

harelba commented 7 years ago

hi, sorry for the late reply.

As @bitti wrote (thanks...), this is indeed an issue related to not autodetecting the column type as float. This is related to an old issue I've opened a long time ago but never got to solve ( #57 ).

I will take a deeper look to see how this can be fixed asap.

Thanks, Harel

Btw, obviously not a good enough workaround, but writing 205.0 in the additional row would have worked around the problem as well.

In general, checking which types q detected on a certain file can be done by adding -A to the command line, as follows:

$ q -d , -H -A "select * from textfile1.csv.txt"
Table for file: textfile1.csv.txt
  `Value` - float
  `Timestamp` - text
  `MetricId` - text
  `Unit` - text
  `Description` - text
  `Entity` - text
  `EntityId` - text
  `IntervalSecs` - int
  `Instance` - text
$ q -d , -H -A "select * from textfile2.csv.txt"
Table for file: textfile2.csv.txt
  `Value` - text
  `Timestamp` - text
  `MetricId` - text
  `Unit` - text
  `Description` - text
  `Entity` - text
  `EntityId` - text
  `IntervalSecs` - int
  `Instance` - text
$
barrycforever commented 7 years ago

Thanks for the hint on previewing the interpretation of data. I should have realized that nothing comes for free. :)

Is there any way to provide a hint of the datatype (or force the interpretation of a datatype)? Thanks!

harelba commented 7 years ago

Sorry, it's not possible to provide a hint for the datatype.

barrycforever commented 7 years ago

Okay, thanks.