ing-bank / scruid

Scala + Druid: Scruid. A library that allows you to compose queries in Scala, and parse the result back into typesafe classes.
Apache License 2.0
115 stars 29 forks source link

Support for Select, Scan and Search queries. #83

Closed anskarl closed 4 years ago

anskarl commented 4 years ago

Scruid at the moment supports aggregation queries (timeseries, group-by and top-n). It would be also useful to extend the functionality of the library to support Select, Scan and Search queries.

While it is straightforward to implement such queries in Scruid, the format of the resulting data is different and cannot be handled by the current implementation.

Specifically, the format of the resulting data for timeseries and group-by queries is like below:

[
  {
    "timestamp": "2012-01-01T00:00:00.000Z",
    "result": { ... }
  },
  {
    "timestamp": "2012-01-02T00:00:00.000Z",
    "result": { ... }
  }
]

It is an array of JSON structures, each one is composed of a timestamp and a result which is a JSON structure.

The format of top-n queries is slight different, each time-stamped row contains a result which is an array of JSON structures:

[
  {
    "timestamp": "2012-01-01T00:00:00.000Z",
    "result": [{ ... }, { ... } ... ]
  },
  {
    "timestamp": "2012-01-02T00:00:00.000Z",
    "result": [{ ... }, { ... } ... ]
  }
]

The resulting data (array of JSON structures) of any aggregation query (timeseries, group-by and top-n) is handled by the class ing.wbaa.druid.DruidResponse and each result (array or not) is represented by the class ing.wbaa.druid.DruidResult.

Select queries return raw Druid rows and support pagination. The format of the resulting data is close to the aggregation queries, an array of JSON objects with timestamp and a result which is a JSON structure:

[{
  "timestamp" : "2013-01-01T00:00:00.000Z",
  "result" : {
    "pagingIdentifiers" : {
      "wikipedia_2012-12-29T00:00:00.000Z_2013-01-10T08:00:00.000Z_2013-01-10T08:13:47.830Z_v9" : 4
    },
    "events" : [ {
      "segmentId" : "wikipedia_editstream_2012-12-29T00:00:00.000Z_2013-01-10T08:00:00.000Z_2013-01-10T08:13:47.830Z_v9",
      "offset" : 0,
      "event" : { ... }
    }, ...

        ]
    }, ...
]

The only difference is that the result structure contains an array of events, therefore it requires a different implementation of ing.wbaa.druid.DruidResponse.

Scan queries do not support pagination like Select queries, but are more efficient and return rows in streaming mode. Regarding the format of the result, compared to aggregation queries, it does not contain a timestamp but the segmentId. The timestamp, however, can be retrieved by the inner event structures. Below is an example fragment of the resulting data of a scan query:

[ {
    "segmentId" : "wikipedia_editstream_2012-12-29T00:00:00.000Z_2013-01-10T08:00:00.000Z_2013-01-10T08:13:47.830Z_v9",
    "columns" : [ "timestamp", "dim1", "dim2", ... ],
    "events" : [ { "timestamp" : "2013-01-01T00:00:00.000Z", "dim1": "some_value", "dim2": "some_other_value", ... }, { ... }, ... ]
    }, ...
]

Furthermore, scan queries can return data in different format (compacted list) and also have a legacy mode for the timestamp dimension, in which timestamp is being replaced by the __time dimension --- for details see official documentation.

Search queries return dimension values that match the search specification. The format is close to top-n queries, timestamp field and result is an array of JSON structures.

[
  {
    "timestamp": "2012-01-01T00:00:00.000Z",
    "result": [
      {
        "dimension": "dim1",
        "value": "some_value",
        "count": 3
      },
      {
        "dimension": "dim2",
        "value": "some_value",
        "count": 1
      }, ...
    ]
  }, ...
]

The main difference here is that the format of the JSON structures in result is always composed of the same fields, that is dimension, its value and the corresponding count. So the issue here is that list[T] and series[T] functions of ing.wbaa.druid.DruidResponse can only be applied to any class having those three particular fields. I think, however, that for practical reasons it is better to have list and series functions without type parameters and return some predefined class with those fields.

With respect to the aforementioned issues, in order to support Select, Scan and Search queries, ing.wbaa.druid.DruidResponse and ing.wbaa.druid.DruidResult have to be adapted, as well as apply minor changes to ing.wbaa.druid.client.DruidClient and ing.wbaa.druid.client.DruidResponseHandler.

anskarl commented 4 years ago

Commit 7c0f737 implements the aforementioned changes. An outline of the changes is given below: