[RFC] Pre Compute Aggregations with Star Tree index

opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.

https://opensearch.org/docs/latest/opensearch/index/

Apache License 2.0

9.47k stars 1.74k forks source link

[RFC] Pre Compute Aggregations with Star Tree index #12498

Open bharath-techie opened 6 months ago

bharath-techie commented 6 months ago

Is your feature request related to a problem? Please describe

Aggregations are the most used query type in observability use cases and the aggregation is typically on metrics, request logs, etc. and spread across different fields. Heavy aggregation queries doesn’t have an upper bound in terms of time taken and resource consumption. In general, the performance of an aggregation query is relative to the number of documents.

Aggregations are faster when they are run on rolled up indices and transforms as they help in reducing granularity and providing materialized views. But, they are generally done once ingestion is complete, similar constructs doesn’t exists for live/active data.

Existing solutions

Aggregation framework

Existing aggregation framework in OpenSearch is quite powerful with bucket, pipeline and metric aggregations. It does support wide range of aggregations since it operates on the non altered live documents of the indices. And as it operates on live original data, deletes and updates are also supported.

While it works great, it has few cons

Query latency scales linearly with the number of documents.
- For example, aggregation which measures count of ‘status’ in the HTTP logs workload in OSB, takes a lot longer for '200' status [~200M documents] compared to '400' status [~3000 documents]
If there are multiple aggregations on different fields, each field / aggregation is processed separately and in general it leads to higher query latencies.
- The following example captures query time as user adds more aggregations:
  - if user queries on status = 200 + aggregation of sum (size) = ~4s
  - if user queries on status = 200 + aggregations of sum (size) , avg(size) = ~6s
  - if user queries on status = 200 + aggregations of sum (size) , avg(size) , max(size) = ~7.5s

Index rollups / Index transforms

Index rollup jobs lets users reduce data granularity by rolling up old data into condensed indices, transform jobs lets users create a different, summarized view of users' data centered around certain fields. These solutions are generally used to save storage space and also the queries are faster as the number of documents are reduced.

Cons:

These solutions are generally used to reduce granularity of older data and not used on real time data.
Configuring the jobs and managing the indices is cumbersome.
- Cannot search a mix of rollup and non-rollup indexes with the same query
- Initial rollup job can become expensive depending on the size of the data

Describe the solution you'd like

Star tree index

I’m exploring the use case of pre-computing the aggregations using Star Tree index while indexing the data based on the configured fields (dimensions and metrics) during index creation. This is inspired from http://hanj.cs.illinois.edu/pdf/vldb03_starcube.pdf and Apache Pinot’s Star Tree index. Star Tree helps to enforce upper bound on the aggregation queries ensuring predictable latency and resource usage, it is also storage space efficient and configurable.

Star Tree index is a multi-field index, contrary to existing index structures in Lucene and OpenSearch which are on single field.

Improvements with Star Tree index

Users can perform faster aggregations with a constant upper bound on query latency.
Star tree natively supports multi field aggregations.
Star tree index will be created in real time as part of regular indexing, so the data in star tree will always be up to date with the live data.
Star tree index consolidates the data and is a storage efficient index, and hence brings following benefits
- Complex aggregation queries will cause fraction of IO utilization compared to present solutions.
- Paging also will be more efficient as less data gets paged for every star tree query

While it provides the above improvements, it does have its limitations, they are:

Star tree index will be created only on append-only indices, updates or deletes will not be supported.
Star tree index will be used for aggregation queries only if the query input is a subset of the star tree configuration of dimensions and metrics
Star tree index will support limited set of aggregations such as MIN, MAX, COUNT, SUM, AVG. Others are yet to be explored
The cardinality of the dimensions should not be very high (like _id fields), otherwise it leads to storage explosion and higher query latencies.
Changing Star Tree config (dimension or metrics) will require a re-index operation.

Given the above details, it does makes a case on why Star Tree is valuable to OpenSearch, I’m working on prototyping it. I’ll follow up on this issue with prototype details along with the benchmark results.

Related component

Search:Aggregations

Describe alternatives you've considered

Mentioned in the Existing Solutions.

Additional context

No response

bharath-techie commented 6 months ago

Prototype Details

For the prototype, ‘DocValuesFormat’ in Lucene is extended to create Star Tree index with support for ‘Numeric’ fields as the dimensions of the Star Tree. Timestamp field is rounded off to various granularity of epoch such as ‘minute’, ‘hour’, ‘day’ etc. and are also added as dimensions to the Star Tree.

Fig: Star Tree Index Structure

Star tree index structure

Star Tree indexing

Star tree index structure

Star Tree index structure as portrayed in the above figure, consists of mainly two parts: the Star Tree and sorted + aggregated documents backed by doc value indices.
Each node in the Star Tree points to a range of documents.
- A node is further split into child nodes based on maxLeafDocs configuration.
- The number of documents a leaf node points to, should be less than or equal to maxLeafDocs. This ensures the maximum number of documents that gets traversed to get to the aggregated value is at most maxLeafDocs, thus providing predictable latencies.
There are special nodes called ‘star nodes’ which helps in skipping non competitive nodes and also in fetching aggregated document wherever applicable during query time.
The figure contains three examples explaining the star tree traversal during query: i) Average request size for all requests containing hour 11 and status 200, ii) Count of requests for status = 200 (across all hours, hence hour is set to ) and iii) Average request size of all requests for hour 12 (status is set to ). The examples (ii) and (iii) uses star nodes.

Creation Flow

Star Tree index is created during Lucene ‘flush’ operation.

The dimension and metric values are fetched from the segment using the corresponding doc values indices. These records are then sorted by dimension split order and the metrics are aggregated.
An initial Star Tree is built over this sorted, summarized data. Further hierarchical splitting and aggregation creates specialized star nodes, enabling pruning and aggregation shortcuts for queries.
The Star Tree is then expanded further for dimensions based on ‘maxLeafDocs’ configuration set during index creation.

Merge Flow

Merge flow reuses the pre-aggregated Star Tree docs that were created during flush operation.
Post that, merge flow repeats the star tree construction process [from step 2 of creation flow]

Lucene index files

Star Tree index is stored under (new) extension ‘.sti’ part of Lucene segment.
Individual doc values indices for all the dimensions and metrics of the Star Tree and files are saved under (new) extensions ‘.std’ and ‘.stm

TODO: Create an issue with Lucene on the new formats.

Star Tree query

A new query and aggregation spec is introduced for Star Tree queries. With this new query, OpenSearch will traverse through the Star Tree index and corresponding (Star Tree) doc values to get the results.
In Lucene, ‘DocValuesReader’ is extended to read the Star Tree indices and a new method ‘getAggregatedValues’ is exposed in LeafReader to read the Star Tree indices in query time. (TODO: Open issue with Lucene).
For prototype, since only ‘Numeric’ fields are supported, actual term queries of ‘text’ which involves postings format will be compared against later when text fields support is introduced in Star Tree.
Star Tree query support can be extended for:
- Term aggregations and multi term aggregations
- Date histogram aggregations
- Range query aggregations
- Filter aggregations - Filters can be on any number of dimensions in a single query. But across dimensions, only ‘MUST’ will be supported, as filters are applied on one dimension after the next as per the dimensions split order.

POC code

https://github.com/bharath-techie/OpenSearch/commit/1d44e1a7b2d0fe1b98f648eefc5e70ff2de959c2 - OpenSearch code changes

https://github.com/bharath-techie/lucene/commit/c993b9a89ab43e553537c24b40704d392a0a968e - Lucene code changes

bharath-techie commented 6 months ago

Benchmark Results

OpenSearch Setup

Instance type: Amazon EC2 r5.4x.large [16 VCPU and 128 GB of RAM]
Disk: Amazon EBS volume with 300 GB storage, 125 MiBS throughput and 3000 IOPS
Number of nodes: 1
JVM heap: 32 GB

Benchmark Setup

Instance type: Amazon EC2 r5.4xlarge
Number of nodes generating load: 1

Workload

180 GB of HTTP logs indexed data [around 4 months of data]
Document distribution based on status: 140k documents of 400s compared to 1 billion documents of 200s.

Star Tree config

Dimensions

Status
Minute
Hour
Day
Year

Metric

Count of requests

Search (primarily Aggregations)

Here is a summary of how Star Tree queries fared against the aggregation queries on present day indices. [against Points index mainly ]	Query (aggregation)	Median throughput	50 percentile Service time
200s in range count	8x better performance	5x faster	3M matching documents
400s in range count	40% improvement	40%	100 matching documents
Total 200s count	1000x better performance (best case scenario)	1000x	Best case scenario [1 billion docs]
Total 400s count	1.8x to 2x improvement	2x improvement	100k matching docs
Hourly histogram	15x improvement	15x improvement

Observations

Queries in Star Tree have bigger performance gains compared to existing solutions when the number of docs matching the query is large.
In addition to query performance improvement, there was also significant reduction in page cache utilization for Star Tree queries (30x for 400 count query for example).

Indexing

Given that the aggregation now happens during the indexing, there are tradeoffs w.r.t. storage, indexing throughput, cumulative times during indexing etc. which are captured below.

Metric	Plain indexing (Baseline)	Star Tree indexing (Minutely)	Percentage
Store size	18.47 GB	18.55 GB	0.45%
Cumulative indexing time of primary shards	129.13 min	131.58 min	-1.82%
Median throughput [ Index append ]	348765.32 docs/s	342881.75 docs/s	-1.71%
50% latency [ Index append ]	98.18 ms	98.24 ms	+0.06%
90% latency [ Index append ]	131.40 ms	131.17 ms	-0.17%
99% latency [ Index append ]	513.44 ms	569.45 ms	-9%
Cumulative merge time of primary shards	34.18 min	35.59 min	-3.96%
Cumulative refresh time of primary shards	4.00 min	4.61 min	-13%
Cumulative flush time of primary shards	6.71 min	7.31 min	-8.25%

Observations

The write threads were occupied on flush flows more during indexing with Star Tree enabled. So there is some impact on the indexing throughput and latency (mainly tail latency) because of that.
The sort and aggregation part of segment records is quite heavy, hence its done off heap and it scales linearly to number of segment records during flush.
During merge, the pre-aggregated docs are reused to reconstruct the star tree instead of original docs, hence it is less expensive compared to flush.

Store size with respect to cardinality

Following captures store size with respect to cardinality of timestamp (as timestamp has one of the highest cardinalities and is present in all scenarios)

Store size without Star Tree - 16.6 GB
Star Tree index size (sti, std and stm combined)
- +2 MB - Hourly rounded timestamp [0.01%]
- +60 MB - Minutely rounded timestamp [0.4 % ]
- +800 MB - Timestamp ( second ) [ 5% ]

Hence super high cardinality fields must be avoided as part of Star Tree index.

andrross commented 6 months ago

Only a 1000x performance improvement? :) These are awesome results @bharath-techie! One minor note, I'd encourage you to link any prototype code you have on your user fork in case folks want to dive into the implementation.

Star tree index will be created only on append-only indices, updates or deletes will not be supported.

I'm sure there are lots of deep technical issues to dive into here, but I just want to highlight that to my knowledge we don't actually have a true "append-only index" abstraction in OpenSearch today. Data streams are partially that: only "create" operations are allowed when referencing the data stream entity. However, the underlying indexes that make up a data stream are still exposed to users and allow updates and deletes. I'm not sure if that is more of a side effect of the implementation, or if users truly need the option to update/delete documents on a limited basis even for a log analytics-style workload.

msfroh commented 6 months ago

I'm sure there are lots of deep technical issues to dive into here, but I just want to highlight that to my knowledge we don't actually have a true "append-only index" abstraction in OpenSearch today. Data streams are partially that: only "create" operations are allowed when referencing the data stream entity. However, the underlying indexes that make up a data stream are still exposed to users and allow updates and deletes. I'm not sure if that is more of a side effect of the implementation, or if users truly need the option to update/delete documents on a limited basis even for a log analytics-style workload.

We should be able to apply the optimization on a per-segment basis, and fall back to the classic behavior on segments with deletes, right @bharath-techie? Hopefully not all segments would have deletes. Hopefully unmerged deletes would be a transient issue (especially now that it's safe to expunge deletes).

Failing that, I kicked off a thread on lucene-dev to discuss the idea of making deletes cheaply iterable. In that case, we could use the star tree to get the aggregations ignoring deletions, then step through the deleted docs to decrement them from the relevant buckets.

bharath-techie commented 6 months ago

Hi @andrross @msfroh , Thanks for the comments. Once star tree has already pre-computed the aggregations , we go through the star tree index to fetch results for the relevant queries. So document deletions post that are not accounted. Hence results will be inaccurate and that's the current limitation.

We can consider both approaches mentioned by @msfroh to address this gap

Use traditional aggregations if segment has deletes and use star tree aggregations if otherwise.
- This will work but it might not be a trivial implementation , so we can probably take this up as a follow up after initial implementation.
Iterating through deleted docs and negating in the relevant buckets
- This too works well and only issue I see is cost of recomputing is directly relative to the number of deleted docs. If the number is quite high, then it'll cause slowdowns.
- We also need to check if certain type of aggregations can be derived with the negated data. Initial supported set ( sum, avg, count, max, min ) can be derived. Other to-be explored aggregations need to be checked on case by case basis.

Updated the POC code links in prototype section.

msfroh commented 6 months ago

Out of curiosity, @bharath-techie, how does this compare to using a multidimensional points field?

Your first dimension could be status code and your second dimension could be the timestamp. Traversing the BKD tree, I think you could also get some counts quickly. (Not as quickly as in the 1-D case, since you'd need to visit more "partial" leaves, but you could probably fix that by quantizing/rounding the timestamp dimension as you did here.).

backslasht commented 6 months ago

I'm sure there are lots of deep technical issues to dive into here, but I just want to highlight that to my knowledge we don't actually have a true "append-only index" abstraction in OpenSearch today. Data streams are partially that: only "create" operations are allowed when referencing the data stream entity. However, the underlying indexes that make up a data stream are still exposed to users and allow updates and deletes. I'm not sure if that is more of a side effect of the implementation, or if users truly need the option to update/delete documents on a limited basis even for a log analytics-style workload.

That's timely! @mgodwan and I were recently discussing about context based/aware indices where the restriction of no updates can be enforced. More details can be found at https://github.com/opensearch-project/OpenSearch/issues/12683.

bharath-techie commented 6 months ago

Out of curiosity, @bharath-techie, how does this compare to using a multidimensional points field?

Based on tests against 1d points field [ status aggregation for example ] , BKD traversal is quite fast , the high latency is always because of the number of resultset documents that OpenSearch needs to aggregate upon.

So I suspect that the same will be applicable for the 2d points as well.

gashutos commented 6 months ago

That's timely! @mgodwan and I were recently discussing about context based/aware indices where the restriction of no updates can be enforced. More details will be added in the RFC (TODO: link it here once created).

Segment merge for star tree index is light weight. Since it has to do summation/substraction mostly. For deleted documents, ca we create another deleted star tree index ? We can do frequent merged on that index every couple of minutes since it is light weight. @backslasht @mgodwan @bharath-techie what do you think on that ? I am strictly speaking form the view of star tree here.

bharath-techie commented 6 months ago

Segment merge for star tree index is light weight. Since it has to do summation/substraction mostly. For deleted documents, can we create another deleted star tree index ?

The cost depends on the cardinality of the dimensions and the number of deleted documents in the segment. In worst case scenarios, 'deleted star tree index' creation might take up a lot of time. Also I'm a bit unclear on where we'll create the new ST index , as we can't modify the existing segments ?

Deletes are also discussed here , one of the approaches too has similar cons , when the number of deleted docs is too high.

msfroh commented 5 months ago

Based on tests against 1d points field [ status aggregation for example ] , BKD traversal is quite fast , the high latency is always because of the number of resultset documents that OpenSearch needs to aggregate upon.

For 1D points, we can aggregate via Weight.count(), which is what we've done to improve performance of date histograms (see https://github.com/opensearch-project/OpenSearch/issues/9310). It doesn't need to collect the individual documents (except over the two "edge leaves"). So, it mostly does reduce to a BKD traversal (plus up to 2 leaf traversals, visiting at most 1024 docs).

So I suspect that the same will be applicable for the 2d points as well.

For 2D points, Weight.count() can't guarantee O(1) computation, there's no guarantee that the number of "edge leaves" is small (since you're not talking about leaves who intersect the endpoints of your range, but rather the number of rectangles that intersect the perimeter of your rectangle). Still, we might get good performance by summing up counts from the middle of the rectangle and then visiting the leaves that overlap the perimeter.

The difference with the star tree seems to be that the star tree is precomputed to define the boundaries of individual rectangles (e.g. quantizing by hour). I'm wondering if we could inject some quantization (i.e. potential additional levels) into the BKD structure to achieve the same thing.

bharath-techie commented 5 months ago

For 2D points, Weight.count() can't guarantee O(1) computation, there's no guarantee that the number of "edge leaves" is small (since you're not talking about leaves who intersect the endpoints of your range, but rather the number of rectangles that intersect the perimeter of your rectangle). Still, we might get good performance by summing up counts from the middle of the rectangle and then visiting the leaves that overlap the perimeter. I'm wondering if we could inject some quantization (i.e. potential additional levels) into the BKD structure to achieve the same thing.

Thanks @msfroh for detailed context. I'll do one round of benchmarks with 2d points with status and rounded off timestamp and capture how it fares.

bharath-techie commented 5 months ago

Hi @msfroh , Ran some benchmarks with 2d points and here are some observations :

In cases where it worked - ( count ) , it was indeed really fast. For 5 million docs , it took around 1 millisecond for hourly timestamp + status points field.
But if we need to further try any other aggregations than count, such as sum, max, avg etc , then query time scales linearly with the number of result documents.

Issues with count shortcut :

The code prevents from calculating count for anything other than 1d point fields. ( or ) if the min,max in query is same as the complete range of points field ( min to max )

So, I removed this restriction and tested the weight.count for 2d points field which combines status and timestamp.

For some of the min,max ranges I also got the following error. ( code ref ) This was random, not able to pinpoint what the issue was.

java.lang.UnsupportedOperationException: This IntersectVisitor does not perform any actions on a docID=115000 node being visited
at __randomizedtesting.SeedInfo.seed([F11BEF308C1F9A52:51840C6DDA7A6507]:0)
at org.apache.lucene.search.PointRangeQuery$1$5.visit(PointRangeQuery.java:425)
at org.apache.lucene.util.bkd.BKDReader$BKDPointTree.visitDocValuesWithCardinality(BKDReader.java:826)
at org.apache.lucene.util.bkd.BKDReader$BKDPointTree.visitDocValues(BKDReader.java:610)
at org.apache.lucene.util.bkd.BKDReader$BKDPointTree.visitLeavesOneByOne(BKDReader.java:595)
at org.apache.lucene.util.bkd.BKDReader$BKDPointTree.visitDocValues(BKDReader.java:589)

msfroh commented 5 months ago

Ahh... I was thinking of modifying the BKD tree a little in order to guarantee that the count could be computed without needing to traverse the documents in individual leaf nodes. It would be a little like your precomputed star-tree (including needing to decide ahead what values you want), but I think it could use an existing index data structure.

Basically, I think it should be possible to inject quantized (daily, hourly) levels into the BKD tree so that a node exactly spans one day/hour. Then counting wouldn't require visiting any leaf nodes. Similarly, the "other" dimensions could split children on value boundaries (i.e. create subtrees where all children have the same value).

bharath-techie commented 5 months ago

Hey @msfroh , Thanks for the inputs.

We were already doing the quantization in the previous experiment. When we add a document, we add a 2d point field rounding off the timestamp to the hour along with the status.
In order to avoid visiting doc values, we need to query the node containing the unique dimensions (so essentially query should be satisfied only by nodes with ‘CELL_INSIDE_QUERY’). I think this can be achieved by not limiting the recursion based on the ‘maxPointsInLeafNode’ parameter (512) and instead doing it based on the doc values containing the unique dimensions during indexing.

Let me know if I’m misunderstanding this ^

But what I notice is that, the splitting in BKD tree is done on dimension with the largest span (max - min). So when we introduce multiple dimensions into BKD, and user queries for the subset of the dimensions with smaller span, the number of nodes that gets visited becomes quite high. With new recursion logic, there will be high number of nodes depending on the cardinality of the dimensions. For example, if we consider 2D point of hour and status and if user queries for ‘200’ across all timestamps, then majority of the nodes get visited.

In the star tree, ‘star’ nodes helps in completely skipping the dimensions which are not present as part of the query, thus enabling efficient pruning. Also there are other advantages such as to support for term/multi term queries.

Let me know your thoughts.

msfroh commented 5 months ago

Ahh... thanks @bharath-techie! That makes sense.

In terms of deciding what values to precompute + store, I can see how it would be hard to predict before building out an index (or at least it would be extremely limiting if you need to know ahead of time).

As I commented over on the linked Lucene issue, would it make sense to approach this like a caching problem? In particular I'm wondering if we could identify dimensions from a set of requests, compute (and persist) a star tree for existing segments, and subsequently compute star trees on segment flush/merge. As a starting point, we (OpenSearch) could introduce an API that could take an explicit set of dimensions. What do you think?

bharath-techie commented 5 months ago

Thanks again for the comments @msfroh. Yeah I think its a really good idea to use star tree as cache / side car index for the original index based on explicit set of dimensions.

We are currently working on a POC ( RFC post that ) to define 'composite field' in OpenSearch which can take set of fields as dimensions along with metric / field combinations. The underlying backing index for the field could be star tree to start with and later any other such index depending on use case.

W.r.t supporting star tree for existing segments, since segments are immutable, how do we add any files post segment creation - is there any existing concept we can refer to ?

msfroh commented 5 months ago

W.r.t supporting star tree for existing segments, since segments are immutable, how do we add any files post segment creation - is there any existing concept we can refer to ?

My thinking was that we could compute and write a side-car star tree, where the segment's ID would be a "key" to identify the tree. So, e.g. if we have segment _5g, we could compute a star tree for it and name it _5g's star tree.

Of course, if we need to support deletes, it wouldn't necessarily be tied to a segment, but rather a SegmentCommitInfo, kind of like live docs (which are identified by a segment generation and a commit generation).

bharath-techie commented 5 months ago

Thanks that makes sense. Let me explore this a bit more and come up with pros and cons.

bharath-techie commented 4 months ago

Evaluated the idea of side car index (outside associated Lucene segment ) but maintaining two separate readers/writers etc will be an additional overhead. Sorting/aggregation while building star tree also will be quite resource intensive if performed on larger segments. So it'll be hard to maintain real-time sync with original index.

But, creating star tree index for existing segments is a pretty good use case, we can again reevaluate and think of better approaches in later cycles.

Meanwhile, I have updated the Lucene Github issue expanding the proposal for new format which can create multi-field indices based on existing fields (DocValues) as part of indexing flow.

bharath-techie commented 4 months ago

To have initial implementation as an experimental feature, and to use existing interfaces of Lucene without getting blocked on new changes , I propose following implementation in OpenSearch.

Custom format

We can extend PerFieldMappingPostingFormatCodec or introduce a similar codec with custom implementation for PerFieldDocValuesFormat

public class PerFieldMappingPostingAndDVFormatCodec extends Lucene99Codec {
    private final Logger logger;
    private final MapperService mapperService;
    private final DocValuesFormat starTreeFormat = new StarTreeDocValuesFormat();

    @Override
    public DocValuesFormat getDocValuesFormatForField(String field) {
        if ( this field is part of star tree configs ( one of dims / metrics of the fields )  )
                return new StarTreeDocValuesFormat();

        return super.getDocValuesFormatForField(field);
    }
}

public class StarTreeDocValuesFormat extends DocValuesFormat {

    public StarTreeDocValuesFormat() {
        super("startree");
    }

    @Override
    public DocValuesConsumer fieldsConsumer(SegmentWriteState state) throws IOException {
        return new StarTreeDocValuesWriter(state.segmentInfo.getCodec().docValuesFormat().fieldsConsumer(state), state);
    }

    @Override
    public DocValuesProducer fieldsProducer(SegmentReadState state) throws IOException {
        return new StarTreeDocValuesReader(state.segmentInfo.getCodec().docValuesFormat().fieldsProducer(state), state);
    }
}

Writer changes

This includes changes for building star tree as part of ‘flush’ and ‘merge’ through custom doc values consumer.

Flush : The idea is to have a check as part of each doc values field’s flush , if we can build star tree using the indexed fields so far. If so we will build star tree.

We can optimize this further to call build star tree after all DocValues fields are indexed as well ( based on state.fieldInfos )

public class StarTreeDocValuesWriter extends DocValuesConsumer {

   Map<String, SortedNumericDocValues> numericDocValuesReaders = new HashMap();
   Map<String, SortedSetDocValues> sortedSetDocValuesReaders = new HashMap();

   public StarTreeDocValuesWriter(
        DocValuesConsumer delegate, SegmentWriteState segmentWriteState) {

        // Fetch star tree fields from segmentinfo attributes map
        List<StarTreeField> starFields = getStarTreeFields(segmentWriteState.segmentInfo.getAttributes());
   }

   /**
   * Flush related changes
   **/

   @Override
    public void addSortedNumericField(FieldInfo field, DocValuesProducer valuesProducer) throws IOException {
        delegate.addSortedNumericField(field, valuesProducer);
        numericDocValuesReaders.put(field.name, valuesProducer.getSortedNumeric(field));
        for(StarTreeField starField : starFields) {
            if(!isMerge && canBuildStarTree(starField)) {
                buildStarTree(starField);
            }
        }
    } 

   @Override
    public void addSortedSetField(FieldInfo field, DocValuesProducer valuesProducer) throws IOException {
        delegate.addSortedSetField(field, valuesProducer);
        textDocValuesReaders.put(field.name, valuesProducer.getSortedSet(field));
        for(StarTreeField starField : starFields) {
            if(!isMerge && canBuildStarTree(starField)) {
                buildStarTree(starField);
            }
        }
    }

    private void canBuildStarTree(StarTreeField starField) {
        // if all the dimensions fields and metrics fields of this star tree field 
       // are present as part of 'text/numericDocValuesReaders', then return 'true'
       // as we will be able to build star tree using these indexed values
    }

    private void buildStarTree(StarTreeField field) {
        // Build star tree
        // Mark this star tree field 'Indexed' for this field
    }

    /**
    * Merge related changes
    **/
    @Override
    public void merge(MergeState mergeState) throws IOException {
        // marking this flag as true as merge also calls "add..Field" methods
        // and we have a different way to buid star tree for merge
        isMerge = true;
        super.merge(mergeState);
        isMerge = false;
        createStarTreeMerge(mergeState);
    }

}

Reader changes

We can have an interface which gives the ‘ReadStarTree’ capability to the custom doc values readers.

/**
* DocValues readers containing custom logic for star tree can implement this interface
**/
public interface StarTreeReader {
    StarTreeAggregatedValues getStarTreeValues(String field) throws IOException;
}

/**
* Custom doc values reader for star tree also implementing StarTreeReader
**/
public class StarTreeDocValuesReader extends DocValuesProducer implements StarTreeReader {

    public StarTreeDocValuesReader() {
        // Read/Initialize all doc values fields
        // Read/Initialize all star tree fields 
    }
    .... all docValuesMethods ....

    /**
    * The problem of how to access the above star tree fields will be solved
    * via this method
    **/
    @Override
    public StarTreeAggregatedValues getStarTreeValues(String field)
        throws IOException {
        return getStarTreeValues(field);
    }
}

Query / Aggregation changes

We will get the ‘DocValuesReader’ from the segment and during ‘StarTreeQuery/Aggregation’ , we can check if its of type ‘StarTreeReader’ and if so, we will be able to read the associated data structures.

public class StarTreeQuery extends Query  {
    @Override
    public Scorer scorer(LeafReaderContext context) throws IOException {
        SegmentReader reader = Lucene.segmentReader(context.reader());

        // We get the 'StarTreeReader' instance so that we can get StarTreeValues

        if(!(reader.getDocValuesReader() instanceof StarTreeReader)) return emptyScorer();

        StarTreeReader starTreeDocValuesReader = (StarTreeReader) reader.getDocValuesReader();
        StarTreeAggregatedValues starTreeValues = starTreeDocValuesReader.getStarTreeValues(field);

        // Custom logic using above starTreeValues

    }   
}

public class StarTreeAggregator extends BucketsAggregator {

    @Override
    protected LeafBucketCollector getLeafCollector(LeafReaderContext ctx, LeafBucketCollector sub) throws IOException {
        SegmentReader reader = Lucene.segmentReader(ctx.reader());

        if(!(reader.getDocValuesReader() instanceof StarTreeReader)) return null;
        StarTreeReader starTreeDocValuesReader = (StarTreeReader) reader.getDocValuesReader();
        StarTreeAggregatedValues starTreeValues = starTreeDocValuesReader.getStarTreeValues(field);

        // Collect using starTree's doc values
    }
}

POC code

mgodwan commented 4 months ago

Thanks @bharath-techie for sharing this approach. I think this provides a clear way to achieve this within OpenSearch. In order to support aggregations and queries using the existing interfaces, we will need changes in both the query rewriting, and aggregations framework I guess. Can we come up with a high level idea of the changes required?

Tagging @shwetathareja @msfroh @backslasht @jainankitk as well to gather thoughts on this.

msfroh commented 4 months ago

@bharath-techie -- I need to dig into your POC code, but I really like this approach to let us fast-track the feature in OpenSearch while we work on adding it in Lucene.

jainankitk commented 4 months ago

@bharath-techie - Thank you for writing this detailed proposal. I really like the scale of potential improvements with this index, but also wondering if the use case is generic enough to warrant separate index. I might be wrong, since yet to go through the paper, but some considerations:

While not having current deleted documents limitation is okay, the star tree index cannot have any deletes over the lifetime. Because once a segment has deletion, the star tree is invalid, and there is no good merge path to fix it. Some of the other optimizations for date histogram / terms aggregation also don't work for segments with deleted documents, but if the deleted documents are expunged upon merge, it can again be leveraged. Isn't it possible to account for deleted document within the star tree index during segment merge? Also, are we planning to have setting for disabling any updates/deletes completely if the customer has star tree index enabled?
Opensearch has many different ways for slicing the buckets for aggregations (weekly, quarterly, auto date histogram where the boundary might not be minute, hourly granularity), hence I not sure if minute/hour/day/year suffice? Also, we have done lot of improvements with aggregation use cases already. For example - hourly aggregation is already 100x faster. So, would love to see the benchmark with latest version 2.14 is possible.
We already have 4 dimensions for date which is common field in most documents, the moment I add few other fields the size of tree blows up and gains start reducing. Also, if the top level is status code and I want to do aggregation on just year, I need to aggregate multiple tree branches, which still might be less than iteration over all the documents, but a lot. Probably the reason hourly histogram is only 15x (lol! :P), daily and yearly histogram would be even less performant.

That being said, this is really good stuff, given it is not default (customers need to explicitly enable it) and there would be some stats/multi-dimensional aggregation workloads.

bharath-techie commented 3 months ago

Thanks for the detailed feedback @jainankitk. Let me try to address some of them.

Isn't it possible to account for deleted document within the star tree index during segment merge?

We've discussed here about some possible approaches for this. It has a cost attached to it, we need to see which is better overall.

Also, are we planning to have setting for disabling any updates/deletes completely if the customer has star tree index enabled?

There is a RFC which discusses this.

I not sure if minute/hour/day/year suffice

This is a just a case for POC, we can extend it to all buckets search currently supports. For instance, Hour bucket in POC gives same buckets as hourly histogram, that was the intention which can be extended to all the other granularities.

We already have 4 dimensions for date which is common field in most documents, the moment I add few other fields the size of tree blows up and gains start reducing

Good point. We've not locked in the implementation for date - we could also go with just supporting say 'minute' and then deriving higher time aggregations during query time if its performant and also we can have multiple star trees for different time histograms. The size/performance of tree depends more on cardinality of fields more compared to number of fields. (more fields with less cardinality still worked well during our benchmarks ) But your point is valid , we will try to come up with approx. estimates of the resultant size of star tree and performance based on cardinality and number of fields. ( This is something I got as feedback during multiple discussions as well )

For example - hourly aggregation is already 100x faster. So, would love to see the benchmark with latest version 2.14 is possible.

I did another round of benchmarks on 2.14 - date histograms are indeed blazing fast 🔥 But when we couple other aggregations together - say with status terms or sum aggs etc, then its again slow as the performance is proportional to number of documents collected/traversed via DocValues. This is discussed to some extent previously in the same RFC (as part of weight.count optimization )

Benchmarks done on Amazon AWS EC2 instance of r5.4x large on 2.14

Plain query

Query (aggregation)	Median throughput	90th percentile latency
Hourly histogram with status terms	0.21 ops/s	4.75 seconds
Monthly histogram with status terms	0.28 ops/s	3.65 seconds
Yealy histogram with status terms	0.28 ops/s	3.65 seconds

Star tree query

Config : minute, hour, month, year, clientIp( 1.1 million cardinality ), status

Query (aggregation)	Median throughput	90th percentile latency
Hourly histogram with status terms	3.18 ops/s	329 ms
Monthly histogram with status terms	3.46 ops/s	306 ms
Yealy histogram with status terms	3.51 ops/s	299 ms

Our opensearch/lucene queries are already so optimized, but for most of the complex aggregations, efficiency will be capped based on the amount of docs to traverse/collect, which the star tree solves.

The RFC so far doesn't capture results for terms / multi-terms aggregations ( need to address some gaps like top n aggs, handling high cardinality terms etc ), the cases we found to be the most performant .

jainankitk commented 3 months ago

@bharath-techie - Thank you for patiently addressing my comments and sharing some numbers with multi-field aggregation.

Good point. We've not locked in the implementation for date - we could also go with just supporting say 'minute' and then deriving higher time aggregations during query time if its performant

We can't go about storing just 'minute' right? Because then it needs to be separate bucket for every minute for different hour, day, year. I might be wrong since my understanding of star tree is limited.

and also we can have multiple star trees for different time histograms.

Yeah, this is an option. But again, we need to be careful about how much are we asking the customers to configure. Ideally, we would like to have workload tailored indices for most optimal performance, but we choose BKD, term dictionaries because they work for most use cases and customer need not worry about them

Our opensearch/lucene queries are already so optimized, but for most of the complex aggregations, efficiency will be capped based on the amount of docs to traverse/collect, which the star tree solves.

+1. Iterating through each of the documents is indeed very slow! :(

This is a just a case for POC, we can extend it to all buckets search currently supports. For instance, Hour bucket in POC gives same buckets as hourly histogram, that was the intention which can be extended to all the other granularities.

Maybe we can try POC covering some or most of the date histogram aggregations.

bharath-techie commented 3 months ago

We can't go about storing just 'minute' right? Because then it needs to be separate bucket for every minute for different hour, day, year.

Not sure I fully understand but to elaborate on the idea - during query time, we can get docs of all the nodes of 'minute' dimension and during collection of the docs as part of 'aggregation' we bucketize it as per the histogram requested. The tradeoff is higher query latency but user can save storage space. ( as we need to go through all 'minute' dimension nodes and hence collect/traverse higher number of docs )

we need to be careful about how much are we asking the customers to configure

Ya +1 , we will integrate with templates initially and also other such mechanisms to suggest the best configuration to the users. And multiple star trees is something we will incrementally expose to customers with base functionalities already built to support them.

we can try POC covering some or most of the date histogram aggregations.

Do you see any challenges - If rounding off timestamp while building star tree index is of same logic as of query, we should be able to do it right ?

jainankitk commented 3 months ago

Not sure I fully understand but to elaborate on the idea - during query time, we can get docs of all the nodes of 'minute' dimension and during collection of the docs as part of 'aggregation' we bucketize it as per the histogram requested.

I am assuming you want to have, say only status and minute dimensions from above example. In that case, the buckets for minute dimension cannot be only 0-59 or 1-60. Because you need to further qualify each of those minutes into hours/day. Else, how can I answer query like minutely aggregation on range query from 2024-01-01 to 2024-01-02 without iterating over all the documents?

bharath-techie commented 3 months ago

In that case, the buckets for minute dimension cannot be only 0-59 or 1-60.

Ah I get your question , I'm not making the minute to 0-59 but rather I'm rounding off the epoch millis/seconds to epoch minute. We'll again be able to round off epoch minute to epoch hour or higher time dimensions during aggregation and bucketize as per the query.

Code reference in star tree

Rounding directly based on this in datehistogram . In fact we can try reusing this same class if possible.

Apart from this, I think there is a way to give fixed interval as well in histogram, which we'll need to evaluate.

jainankitk commented 3 months ago

Ah I get your question , I'm not making the minute to 0-59 but rather I'm rounding off the epoch millis/seconds to epoch minute.

But doesn't make the cardinality "too high". Just thinking about few months/yearly log data, the cardinality becomes ~100k for 2 months

Rounding directly based on this in datehistogram . In fact we can try reusing this same class if possible.

Rounding is slightly expensive operation, still much faster than the individual document iteration though. Maybe we can have multi range traversal logic on star tree similar to https://github.com/apache/lucene/issues/13335, assuming we can traverse the minutely rounded epochs in sorted order

sarthakaggarwal97 commented 3 months ago

But doesn't make the cardinality "too high". Just thinking about few months/yearly log data, the cardinality becomes ~100k for 2 months

A lot of users with timeseries type workload make a per day index, something like logs-24-11-2024. Since each index will have it's own star tree for the set of dimension/metrics, it will come down to the query part to handle these aggregation. Also, I feel with minutes is one option, but we can explore different ideas to handle queries at different time granularity.

I think from where @bharath-techie is coming from, we can still sort of control the cardinality if, in the dimension split order, we are keeping the timestamp at a higher level than other dimensions.

Something like this: timestamp -> status -> client_ip. If we do it the other way round (timestamp last), we would see a lot more branching.

bharath-techie commented 3 months ago

Yes +1 , choosing the right dimensionSplitOrder controls the storage size.

But most importantly, in a particular segment - what will be the timestamp's cardinality ? For log/time series use cases, as long as we have efficient merge policies ( example : context aware segments ) , most likely we will only have few hours of data in a segment. So even if an index spans several months, in each segment cardinality of the timestamp field will be quite less.

Maybe we can have multi range traversal logic on star tree similar to https://github.com/apache/lucene/issues/13335,

Right now 'star tree' will act as the lead iterator for the query - so far in the design , we have star tree + doc values. So once star tree gives a set of documents for the dimensions which are in tree, we further filter out in doc values if there are any other dimensions in filter. ( not part of star tree as per maxLeafDocs threshold )

Sometime back, we evaluated the idea of using point tree as well for range queries alongside star tree but point tree works well as a lead iterator rather right ?

jainankitk commented 3 months ago

Something like this: timestamp -> status -> client_ip. If we do it the other way round (timestamp last), we would see a lot more branching. Yes +1 , choosing the right dimensionSplitOrder controls the storage size.

I am not completely sure what the right branching is, especially between timestamp/status. For example, if we have uniform distribution of status across timestamps, won't we have equal number of buckets irrespective of the order of timestamp/status? Also, maybe status first is better idea because you can quickly answer the count queries on just status code? Else you need to go through all the timestamp values?

Sometime back, we evaluated the idea of using point tree as well for range queries alongside star tree but point tree works well as a lead iterator rather right ?

Did we evaluate for the case of query and aggregation being on same field or even different ones?

But most importantly, in a particular segment - what will be the timestamp's cardinality ? For log/time series use cases, as long as we have efficient merge policies ( https://github.com/opensearch-project/OpenSearch/issues/13183 ) , most likely we will only have few hours of data in a segment.

Sorry my bad, overlooked the generation being at segment level while doing my analysis.

bharath-techie commented 3 months ago

Also, maybe status first is better idea because you can quickly answer the count queries on just status code? Else you need to go through all the timestamp values?

The paper highly recommends that we order the dimensions from highest cardinality to lowest cardinality for better pruning during query.

That said, we did benchmarks with both during POC , with status first we noticed significantly higher storage size. ( > 70% if I remember correctly , since timestamp which has high cardinality is present across each status)

And from query perspective , status first indeed performed better for status specific aggs [without any range filter ] but for range queries and date histogram we either found timestamp first to be faster or within a 5-10% range off from status first performance.

Query	Status first throughput	Timestamp first throughput
200s in range	91.74 ops/s	134.45 ops/s
400s in range	154.12 ops/s	139.19 ops/s
400s-agg	186 ops/s	112.82 ops/s
200s-agg	177 ops/s	85.45 ops/s

This is something we will continue to tune based on benchmarks / load testing before we rollout to production.

Note : the OSB benchmarks provide out of order data , so timestamp has super high cardinality in all segments. Probably when timestamp has lower cardinality similar to production time-series workloads or when coupled with context aware segments for example, we ideally can put the user dimensions at top and timestamp columns below it.

Did we evaluate for the case of query and aggregation being on same field or even different ones?

We usually evaluate with query and aggs being on different fields, but I get where you're coming from based on histogram improvements. We just bounced off ideas, didn't get around to any implementation. Because we found star tree traversal to be decently fast - on a query on multiple dimensions especially - the latency again mainly came from doc values collection.

jainankitk commented 3 months ago

This is something we will continue to tune based on benchmarks / load testing before we rollout to production.

Agreed. Needs more experimentation.

Note : the OSB benchmarks provide out of order data , so timestamp has super high cardinality in all segments. Probably when timestamp has lower cardinality similar to production time-series workloads or when coupled with context aware segments for example, we ideally can put the user dimensions at top and timestamp columns below it.

Good point, I keep forgetting about that. I am assuming that is because the sorted input data is partitioned to multiple threads?

bharath-techie commented 3 months ago

Good point, I keep forgetting about that. I am assuming that is because the sorted input data is partitioned to multiple threads?

Yes that's right. But, looks like folks run multiple instances of benchmark with numclients:1 to bypass this limitation. But so far all benchmarks are done without such modifications.

bharath-techie commented 3 months ago

We are starting with implementation of this RFC and the meta issue tracks the list of issues and milestones planned for near future.

Please feel free to contribute / review / comment on this.

opensearch-project / OpenSearch

[RFC] Pre Compute Aggregations with Star Tree index #12498

Is your feature request related to a problem? Please describe

Existing solutions

Aggregation framework

Index rollups / Index transforms

Describe the solution you'd like

Star tree index

Improvements with Star Tree index

Related component

Describe alternatives you've considered

Additional context

Prototype Details

Star Tree indexing

Star tree index structure

Creation Flow

Merge Flow

Lucene index files

Star Tree query

POC code

Benchmark Results

OpenSearch Setup

Benchmark Setup

Workload

Star Tree config

Dimensions

Metric

Search (primarily Aggregations)

Observations

Indexing

Observations

Store size with respect to cardinality

Custom format

Writer changes

Reader changes

Query / Aggregation changes

Plain query

Star tree query