executeQuery - column type inference

jbguerraz commented 3 years ago

Rework the column type detect method. Eventually query Druid for columns data types.

hardikbajaj commented 1 year ago

@jbguerraz Can you please assign this issue to me

jbguerraz commented 1 year ago

WIth pleasure @hardikbajaj :) Here to help as much as needed

hardikbajaj commented 1 year ago

Hey @jbguerraz , I'm thinking of inferring column data type from druid itself. Through Druid sql API, I can request for typesHeader and sqlTypesHeader (For Ref)

With the help of sqlTypeHeader, I'm thinking of inferring data type as per following algorithm

switch sqlDataType {
case "TIMESTAMP" , "DATE":
    frame.DataType = []time.Time
case "CHAR" , "VARCHAR":
    frame.DataType = []string
case "BOOLEAN":
    frame.DataType = []bool
default:
    frame.DataType = []float64
}

I'm not sure about Complex data type though. Also, all type of Numbers would be handled as float64, Grafana will visualise them correctly I would like to have your inputs on this, and should I go ahead?

jbguerraz commented 1 year ago

Hello @hardikbajaj and thank you!! that enhancement rocks :)

Druid Web UI itselfs uses such query to display columns info in datasource details pane:

SELECT COLUMN_NAME, DATA_TYPE FROM INFORMATION_SCHEMA.COLUMNS
          WHERE TABLE_SCHEMA = 'druid' AND TABLE_NAME = '{{ datasourceName }}'

also, as per https://druid.apache.org/docs/latest/querying/segmentmetadataquery.html :

{
  "queryType":"segmentMetadata",
  "dataSource":"{{ datasourceName }}",
  "merge": true,
}

Interesting bits:

All columns contain a typeSignature that Druid uses to represent the column type information internally. The typeSignature is typically the same value used to identify the JSON type information at query or ingest time. One of: STRING, FLOAT, DOUBLE, LONG, or COMPLEX\<typeName>, e.g. COMPLEX\<hyperUnique>.

That would probably makes it better to use it to map that way:

STRING:string
FLOAT:float32
DOUBLE:float64
LONG: int64
COMPLEX<typeName>: Ignore for now

Additionally, JSON queries are a bit faster than SQL ones as far as I've seen in past (probably less parsing time).

Probably great to cache those datasources definitions for a period of time at least (like let's say 30 minutes by default ? could later be changed at datasource settings level ?) to avoid querying Druid for each request ? That would also be usable later on for providing autocomplete informations.

WDYT?

m-ghazanfar commented 1 year ago

@hardikbajaj Thanks for working on this. Minor comment: I think that the default case should return string type. This is because most data can be represented as a string. In case we've missed a case here or if there's a new type in the future string will continue to work. Converting to float might not always work and will crash the plugin.

hardikbajaj commented 1 year ago

Hey @jbguerraz ! Thanks for your feedback! This looks great but If I'm not wrong, the SegmentMetaData is also a native query type and It just provide sort of a Schema for existing data types, I think It would be difficult to assert data type for groupBy, scan, virtual columns and other aggregation queries, for which columns are specified by user. I think this is a great solution that would work completely on both SQL and Native queries, but should we first implement the frontend with this and provide autocomplete suggestion, and maybe find a way to assert all datatypes from frontend? Just a thought.

For now I think, the sql api provides us a complete solution, where we can get the data type in Headers from Druid response only. I am thinking of implementing column assertion for SQL queries from Druid, as It would be a great enhancement for inferring columns, and would be using SQL Types for the column, as they are giving us a more detailed knowledge on data type like TIMESTAMP and Float or integers.

@Ghazanfar-CFLT I'll make sure to refine the number types, and make default type as String

hardikbajaj commented 1 year ago

@jbguerraz , This PR change would be required for making SQL query type asserting column type completely from druid. Can you please release the latest druid-go Client?

hardikbajaj commented 1 year ago

Hey @jbguerraz , in case you missed it, can we release latest go-druid client to be used by plugin?

grafadruid / druid-grafana

executeQuery - column type inference #20