opensearch-project / opensearch-spark

Spark Accelerator framework ; It enables secondary indices to remote data stores.
Apache License 2.0
14 stars 23 forks source link

[FEATURE] Tumble function doesn't support expression #626

Open dai-chen opened 3 weeks ago

dai-chen commented 3 weeks ago

Is your feature request related to a problem?

A ClassCastException when using the TUMBLE function with expressions in a CREATE MATERIALIZED VIEW statement.

For example:

CREATE MATERIALIZED VIEW test_day AS
SELECT
  COUNT(1),
  window.start
FROM
  test
GROUP BY
  TUMBLE(CAST(FROM_UNIXTIME(time) AS TIMESTAMP), '1 Hour')
ORDER BY
  window.start;
...

java.lang.ClassCastException: class org.apache.spark.sql.catalyst.expressions.Cast cannot be cast to class org.apache.spark.sql.catalyst.expressions.Attribute (org.apache.spark.sql.catalyst.expressions.Cast and org.apache.spark.sql.catalyst.expressions.Attribute are in unnamed module of loader 'app')
    at org.opensearch.flint.spark.mv.FlintSparkMaterializedView$WindowingAggregate$.unapply(FlintSparkMaterializedView.scala:132)
    at org.opensearch.flint.spark.mv.FlintSparkMaterializedView$$anonfun$1.applyOrElse(FlintSparkMaterializedView.scala:87)
    at org.opensearch.flint.spark.mv.FlintSparkMaterializedView$$anonfun$1.applyOrElse(FlintSparkMaterializedView.scala:86)
...

What solution would you like?

Support expression in TUMBLE function. This is especially useful when time column in the source dataset is not timestamp type.

What alternatives have you considered?

Alternatively, using subquery can be a workaround:

CREATE MATERIALIZED VIEW test_day AS
SELECT
  COUNT(1),
  window.start
FROM (
    SELECT CAST(FROM_UNIXTIME(start) AS TIMESTAMP) AS startTime
    FROM test
)
GROUP BY
  TUMBLE(startTime, '1 Hour')
ORDER BY
  window.start
...

Do you have any additional context?

The first thing is to confirm if Spark can support event time defined by an expression.