opensearch-project / opensearch-spark

Spark Accelerator framework ; It enables secondary indices to remote data stores.
Apache License 2.0
22 stars 33 forks source link

[BUG] Filter on timestamp fields compares date only instead of datetime #560

Open engechas opened 3 months ago

engechas commented 3 months ago

What is the bug? This is a bug with direct query + iceberg.

When adding a filter to a query based on a timestamp field compared to a datetime, the filter is only considering the date from the datetime.

Example:

SELECT * FROM validation.amazon_security_lake_glue_db_us_west_2.amazon_security_lake_table_us_west_2_route53_2_0 WHERE time_dt > '2024-08-09T00:00:00Z' ORDER BY time_dt ASC LIMIT 10

The above query is expected to return all results from 2024-08-09T00:00:01Z to the current time. Instead it only returns results for entries 2024-08-10T00:00:00Z to the current time.

Changing the operator from > to >= returns the expected results.

It looks like Iceberg/Spark are comparing only the dates from the datetime filter.

Explained query for the above example looks like it has converted the datetime to an epoch microsecond correctly, but the query results do not align with that.

== Physical Plan ==
TakeOrderedAndProject(limit=10, orderBy=[time_dt#2099 ASC NULLS FIRST], output=[metadata#2095,cloud#2096,src_endpoint#2097,time#2098L,time_dt#2099,query#2100,answers#2101,connection_info#2102,dst_endpoint#2103,firewall_rule#2104,severity_id#2105,severity#2106,class_name#2107,class_uid#2108,category_name#2109,category_uid#2110,activity_id#2111,activity_name#2112,type_uid#2113L,type_name#2114,rcode_id#2115,rcode#2116,disposition#2117,action#2118,... 6 more fields])
+- *(1) Project [metadata#2095, cloud#2096, src_endpoint#2097, time#2098L, time_dt#2099, query#2100, answers#2101, connection_info#2102, dst_endpoint#2103, firewall_rule#2104, severity_id#2105, severity#2106, class_name#2107, class_uid#2108, category_name#2109, category_uid#2110, activity_id#2111, activity_name#2112, type_uid#2113L, type_name#2114, rcode_id#2115, rcode#2116, disposition#2117, action#2118, ... 6 more fields]
   +- *(1) Filter (time_dt#2099 > 2024-08-09 00:00:00)
      +- BatchScan[metadata#2095, cloud#2096, src_endpoint#2097, time#2098L, time_dt#2099, query#2100, answers#2101, connection_info#2102, dst_endpoint#2103, firewall_rule#2104, severity_id#2105, severity#2106, class_name#2107, class_uid#2108, category_name#2109, category_uid#2110, activity_id#2111, activity_name#2112, type_uid#2113L, type_name#2114, rcode_id#2115, rcode#2116, disposition#2117, action#2118, ... 6 more fields] spark_catalog.amazon_security_lake_glue_db_us_west_2.amazon_security_lake_table_us_west_2_route53_2_0 (branch=null) [filters=time_dt IS NOT NULL, time_dt > 1723161600000000, groupedBy=] RuntimeFilters: []

How can one reproduce the bug? Steps to reproduce the behavior:

  1. Create an Iceberg table with a timestamp field
  2. Filter on this field using a datetime

What is the expected behavior? Filters based on datetimes should return results after the datetime rather than the date within the datetime.

What is your host/environment?

Do you have any screenshots? If applicable, add screenshots to help explain your problem.

Do you have any additional context? Add any other context about the problem.

dblock commented 2 months ago

[Weekly Catch All Triage - 1]