quickwit-oss / tantivy

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust
MIT License
11.82k stars 657 forks source link

Add support for `interval` field for Date histogram #2037

Open fmassot opened 1 year ago

fmassot commented 1 year ago

As described in https://github.com/quickwit-oss/quickwit/issues/3250, we need to support the OpenSearch API too...

To reach that for the date histogram, we need to support the interval field in addition to the fixed_interval field (there is a calendar_interval field but we don't support it yet).

We discussed this issue with @PSeitz, we have 2 ways of doing it:

  1. having separate structs to deserialize the JSON formatted date histogram aggregation
  2. adding a new field interval on the DateHistogramAggregationReq and adding some validation/preprocessing so that we:
    • raise an error if both interval and fixed_interval or interval and calendar_interval are present
    • if interval is present, use it to populate fixed_interval as we only support fixed_interval for now.
fulmicoton commented 1 year ago

@PSeitz Can you take care of this?

We can also discuss long term plans to ditch the JSON representation of aggregations in tantivy, and just have agregation definition "objects" in there. We would then have a JSON -> Aggregation definition object translation layer in quickwit instead. However we need to support this for airmail.

PSeitz commented 1 year ago

Elastic search 7.2:

[7.2] Deprecated in 7.2. interval field is deprecated Historically both calendar and fixed intervals were configured in a single interval field, which led to confusing semantics. Specifying 1d would be assumed as a calendar-aware time, whereas 2d would be interpreted as fixed time. To get "one day" of fixed time, the user would need to specify the next smaller unit (in this case, 24h).

Mapping interval to fixed_interval is only an option for values that are fixed interval in es, otherwise that would break elastic search compatibility.

Elastic search and OpenSearch should not be mixed in the same API endpoint, if they are not compatible, which seems to be the the case for date histogram. So I think we need an explicit OpenSearch endpoint in quickwit.

We don't need separate structs, it could be just a union which then gets validated for elastic search or open search compatibility.

As of now, I don't think we gain something by ditching the JSON representation of aggregations in tantivy, but it would cause a lot of code duplication.

fmassot commented 1 year ago

oh yes sorry for my lack of accuracy, @PSeitz you indeed mentioned the solution of having a dedicated endpoint for OpenSearch.

Mapping interval to fixed_interval is only an option for values that are fixed interval in es, otherwise that would break elastic search compatibility.

Agreed. But we only need to handle fixed intervals in interval for airmail. We could support only that.

Elastic search and OpenSearch should not be mixed in the same API endpoint, if they are not compatible, which seems to be the the case for date histogram. So I think we need an explicit OpenSearch endpoint in quickwit.

Looking at the history of Elasticsearch, I don't see this as an incompatibility but as a support for older elasticsearch version. And when I look at this opensearch issue I would expect them to remove the interval support in the future (though it can take some time).

I think it's overkill to distinguish OpenSearch and Elasticsearch API just for this interval field. That being said, I acknowledge that it would be good to check all endpoints needed for airmail and check if we have other issues like this one. I'm going to do that tonight.

fmassot commented 1 year ago

@PSeitz Just to confirm that I did not find other queries with an API difference between Elasticsearch en OpenSearch for the airmail project.

I had a look at the following queries: query_string, exists, term, percentiles, multi_match.

fmassot commented 1 year ago

@PSeitz do you need more info on your side?

PSeitz commented 1 year ago

Looking at the OpenSearch docs, it's unclear if they handle calendar aware intervals or not. So should we implement the interval halfway, only with values higher than 1?

fmassot commented 1 year ago

For airmail, we need to support: \d+s, \d+m, \d+h. I think we can only support what we support currently for the fixed_interval. On OpenSearch, you can have a look at the code here.

PSeitz commented 1 year ago

1m is date-aware in elastic search, this is incompatible with the fixed interval handling we have. Most users are familiar with elastic search and they would all run into the same API issue (probably unknowingly for some time). 2m etc. is fixed interval on the interval field.

fmassot commented 1 year ago

you mean 1d was date-aware in elasticsearch? So to avoid issues, we could only support ms, s, m, h and avoid the problematic d.

PSeitz commented 1 year ago

1m, 1h, 1d, 1w, 1M, 1q, 1y. They are all date aware. All other values are not, e.g. 24h, 2w etc.

So we could implement the interval halfway, only for values higher than 1. It's a weird API, and will probably cause some user issues, but at least not returning wrong results.

I think it would be better if airmail could call the new API.

fmassot commented 1 year ago

Ok, I have just understood how this was broken before... I thought the issue was only on 1d when reading the comment you quoted previously. Moreover, I saw yesterday in the OpenSearch code that it should be already possible to use the fixed_interval parameter. It's just not documented. This old spec is crazier than I thought... Let's try to push back this on the client side.

(discussion ref)