Open bharath-techie opened 6 months ago
For the prototype, ‘DocValuesFormat’ in Lucene is extended to create Star Tree index with support for ‘Numeric’ fields as the dimensions of the Star Tree. Timestamp field is rounded off to various granularity of epoch such as ‘minute’, ‘hour’, ‘day’ etc. and are also added as dimensions to the Star Tree.
Fig: Star Tree Index Structure
Star Tree index structure as portrayed in the above figure, consists of mainly two parts: the Star Tree and sorted + aggregated documents backed by doc value indices.
Each node in the Star Tree points to a range of documents.
There are special nodes called ‘star nodes’ which helps in skipping non competitive nodes and also in fetching aggregated document wherever applicable during query time.
The figure contains three examples explaining the star tree traversal during query: i) Average request size for all requests containing hour 11 and status 200, ii) Count of requests for status = 200 (across all hours, hence hour is set to ) and iii) Average request size of all requests for hour 12 (status is set to ). The examples (ii) and (iii) uses star nodes.
Star Tree index is created during Lucene ‘flush’ operation.
TODO: Create an issue with Lucene on the new formats.
https://github.com/bharath-techie/OpenSearch/commit/1d44e1a7b2d0fe1b98f648eefc5e70ff2de959c2 - OpenSearch code changes
https://github.com/bharath-techie/lucene/commit/c993b9a89ab43e553537c24b40704d392a0a968e - Lucene code changes
Here is a summary of how Star Tree queries fared against the aggregation queries on present day indices. [against Points index mainly ] | Query (aggregation) | Median throughput | 50 percentile Service time | Comments |
---|---|---|---|---|
200s in range count | 8x better performance | 5x faster | 3M matching documents | |
400s in range count | 40% improvement | 40% | 100 matching documents | |
Total 200s count | 1000x better performance (best case scenario) | 1000x | Best case scenario [1 billion docs] | |
Total 400s count | 1.8x to 2x improvement | 2x improvement | 100k matching docs | |
Hourly histogram | 15x improvement | 15x improvement |
Given that the aggregation now happens during the indexing, there are tradeoffs w.r.t. storage, indexing throughput, cumulative times during indexing etc. which are captured below.
Metric | Plain indexing (Baseline) | Star Tree indexing (Minutely) | Percentage |
---|---|---|---|
Store size | 18.47 GB | 18.55 GB | 0.45% |
Cumulative indexing time of primary shards | 129.13 min | 131.58 min | -1.82% |
Median throughput [ Index append ] | 348765.32 docs/s | 342881.75 docs/s | -1.71% |
50% latency [ Index append ] | 98.18 ms | 98.24 ms | +0.06% |
90% latency [ Index append ] | 131.40 ms | 131.17 ms | -0.17% |
99% latency [ Index append ] | 513.44 ms | 569.45 ms | -9% |
Cumulative merge time of primary shards | 34.18 min | 35.59 min | -3.96% |
Cumulative refresh time of primary shards | 4.00 min | 4.61 min | -13% |
Cumulative flush time of primary shards | 6.71 min | 7.31 min | -8.25% |
Following captures store size with respect to cardinality of timestamp (as timestamp has one of the highest cardinalities and is present in all scenarios)
Hence super high cardinality fields must be avoided as part of Star Tree index.
Only a 1000x performance improvement? :) These are awesome results @bharath-techie! One minor note, I'd encourage you to link any prototype code you have on your user fork in case folks want to dive into the implementation.
Star tree index will be created only on append-only indices, updates or deletes will not be supported.
I'm sure there are lots of deep technical issues to dive into here, but I just want to highlight that to my knowledge we don't actually have a true "append-only index" abstraction in OpenSearch today. Data streams are partially that: only "create" operations are allowed when referencing the data stream entity. However, the underlying indexes that make up a data stream are still exposed to users and allow updates and deletes. I'm not sure if that is more of a side effect of the implementation, or if users truly need the option to update/delete documents on a limited basis even for a log analytics-style workload.
I'm sure there are lots of deep technical issues to dive into here, but I just want to highlight that to my knowledge we don't actually have a true "append-only index" abstraction in OpenSearch today. Data streams are partially that: only "create" operations are allowed when referencing the data stream entity. However, the underlying indexes that make up a data stream are still exposed to users and allow updates and deletes. I'm not sure if that is more of a side effect of the implementation, or if users truly need the option to update/delete documents on a limited basis even for a log analytics-style workload.
We should be able to apply the optimization on a per-segment basis, and fall back to the classic behavior on segments with deletes, right @bharath-techie? Hopefully not all segments would have deletes. Hopefully unmerged deletes would be a transient issue (especially now that it's safe to expunge deletes).
Failing that, I kicked off a thread on lucene-dev to discuss the idea of making deletes cheaply iterable. In that case, we could use the star tree to get the aggregations ignoring deletions, then step through the deleted docs to decrement them from the relevant buckets.
Hi @andrross @msfroh , Thanks for the comments. Once star tree has already pre-computed the aggregations , we go through the star tree index to fetch results for the relevant queries. So document deletions post that are not accounted. Hence results will be inaccurate and that's the current limitation.
We can consider both approaches mentioned by @msfroh to address this gap
Updated the POC code links in prototype section.
Out of curiosity, @bharath-techie, how does this compare to using a multidimensional points field?
Your first dimension could be status code and your second dimension could be the timestamp. Traversing the BKD tree, I think you could also get some counts quickly. (Not as quickly as in the 1-D case, since you'd need to visit more "partial" leaves, but you could probably fix that by quantizing/rounding the timestamp dimension as you did here.).
I'm sure there are lots of deep technical issues to dive into here, but I just want to highlight that to my knowledge we don't actually have a true "append-only index" abstraction in OpenSearch today. Data streams are partially that: only "create" operations are allowed when referencing the data stream entity. However, the underlying indexes that make up a data stream are still exposed to users and allow updates and deletes. I'm not sure if that is more of a side effect of the implementation, or if users truly need the option to update/delete documents on a limited basis even for a log analytics-style workload.
That's timely! @mgodwan and I were recently discussing about context based/aware indices where the restriction of no updates can be enforced. More details can be found at https://github.com/opensearch-project/OpenSearch/issues/12683.
Out of curiosity, @bharath-techie, how does this compare to using a multidimensional points field?
Based on tests against 1d points field [ status aggregation for example ] , BKD traversal is quite fast , the high latency is always because of the number of resultset documents that OpenSearch needs to aggregate upon.
So I suspect that the same will be applicable for the 2d points as well.
That's timely! @mgodwan and I were recently discussing about context based/aware indices where the restriction of no updates can be enforced. More details will be added in the RFC (TODO: link it here once created).
Segment merge for star tree index is light weight. Since it has to do summation/substraction mostly. For deleted documents, ca we create another deleted star tree index
? We can do frequent merged on that index every couple of minutes since it is light weight. @backslasht @mgodwan @bharath-techie what do you think on that ?
I am strictly speaking form the view of star tree here.
Segment merge for star tree index is light weight. Since it has to do summation/substraction mostly. For deleted documents, can we create another deleted star tree index ?
The cost depends on the cardinality of the dimensions and the number of deleted documents in the segment. In worst case scenarios, 'deleted star tree index' creation might take up a lot of time. Also I'm a bit unclear on where we'll create the new ST index , as we can't modify the existing segments ?
Deletes are also discussed here , one of the approaches too has similar cons , when the number of deleted docs is too high.
Based on tests against 1d points field [ status aggregation for example ] , BKD traversal is quite fast , the high latency is always because of the number of resultset documents that OpenSearch needs to aggregate upon.
For 1D points, we can aggregate via Weight.count()
, which is what we've done to improve performance of date histograms (see https://github.com/opensearch-project/OpenSearch/issues/9310). It doesn't need to collect the individual documents (except over the two "edge leaves"). So, it mostly does reduce to a BKD traversal (plus up to 2 leaf traversals, visiting at most 1024 docs).
So I suspect that the same will be applicable for the 2d points as well.
For 2D points, Weight.count()
can't guarantee O(1) computation, there's no guarantee that the number of "edge leaves" is small (since you're not talking about leaves who intersect the endpoints of your range, but rather the number of rectangles that intersect the perimeter of your rectangle). Still, we might get good performance by summing up counts from the middle of the rectangle and then visiting the leaves that overlap the perimeter.
The difference with the star tree seems to be that the star tree is precomputed to define the boundaries of individual rectangles (e.g. quantizing by hour). I'm wondering if we could inject some quantization (i.e. potential additional levels) into the BKD structure to achieve the same thing.
For 2D points, Weight.count() can't guarantee O(1) computation, there's no guarantee that the number of "edge leaves" is small (since you're not talking about leaves who intersect the endpoints of your range, but rather the number of rectangles that intersect the perimeter of your rectangle). Still, we might get good performance by summing up counts from the middle of the rectangle and then visiting the leaves that overlap the perimeter. I'm wondering if we could inject some quantization (i.e. potential additional levels) into the BKD structure to achieve the same thing.
Thanks @msfroh for detailed context. I'll do one round of benchmarks with 2d points with status and rounded off timestamp and capture how it fares.
Hi @msfroh , Ran some benchmarks with 2d points and here are some observations :
Issues with count shortcut :
So, I removed this restriction and tested the weight.count for 2d points field which combines status and timestamp.
java.lang.UnsupportedOperationException: This IntersectVisitor does not perform any actions on a docID=115000 node being visited
at __randomizedtesting.SeedInfo.seed([F11BEF308C1F9A52:51840C6DDA7A6507]:0)
at org.apache.lucene.search.PointRangeQuery$1$5.visit(PointRangeQuery.java:425)
at org.apache.lucene.util.bkd.BKDReader$BKDPointTree.visitDocValuesWithCardinality(BKDReader.java:826)
at org.apache.lucene.util.bkd.BKDReader$BKDPointTree.visitDocValues(BKDReader.java:610)
at org.apache.lucene.util.bkd.BKDReader$BKDPointTree.visitLeavesOneByOne(BKDReader.java:595)
at org.apache.lucene.util.bkd.BKDReader$BKDPointTree.visitDocValues(BKDReader.java:589)
Ahh... I was thinking of modifying the BKD tree a little in order to guarantee that the count could be computed without needing to traverse the documents in individual leaf nodes. It would be a little like your precomputed star-tree (including needing to decide ahead what values you want), but I think it could use an existing index data structure.
Basically, I think it should be possible to inject quantized (daily, hourly) levels into the BKD tree so that a node exactly spans one day/hour. Then counting wouldn't require visiting any leaf nodes. Similarly, the "other" dimensions could split children on value boundaries (i.e. create subtrees where all children have the same value).
Hey @msfroh , Thanks for the inputs.
Let me know if I’m misunderstanding this ^
But what I notice is that, the splitting in BKD tree is done on dimension with the largest span (max - min). So when we introduce multiple dimensions into BKD, and user queries for the subset of the dimensions with smaller span, the number of nodes that gets visited becomes quite high. With new recursion logic, there will be high number of nodes depending on the cardinality of the dimensions. For example, if we consider 2D point of hour and status and if user queries for ‘200’ across all timestamps, then majority of the nodes get visited.
In the star tree, ‘star’ nodes helps in completely skipping the dimensions which are not present as part of the query, thus enabling efficient pruning. Also there are other advantages such as to support for term/multi term queries.
Let me know your thoughts.
Ahh... thanks @bharath-techie! That makes sense.
In terms of deciding what values to precompute + store, I can see how it would be hard to predict before building out an index (or at least it would be extremely limiting if you need to know ahead of time).
As I commented over on the linked Lucene issue, would it make sense to approach this like a caching problem? In particular I'm wondering if we could identify dimensions from a set of requests, compute (and persist) a star tree for existing segments, and subsequently compute star trees on segment flush/merge. As a starting point, we (OpenSearch) could introduce an API that could take an explicit set of dimensions. What do you think?
Thanks again for the comments @msfroh. Yeah I think its a really good idea to use star tree as cache / side car index for the original index based on explicit set of dimensions.
We are currently working on a POC ( RFC post that ) to define 'composite field' in OpenSearch which can take set of fields as dimensions along with metric / field combinations. The underlying backing index for the field could be star tree to start with and later any other such index depending on use case.
W.r.t supporting star tree for existing segments, since segments are immutable, how do we add any files post segment creation - is there any existing concept we can refer to ?
W.r.t supporting star tree for existing segments, since segments are immutable, how do we add any files post segment creation - is there any existing concept we can refer to ?
My thinking was that we could compute and write a side-car star tree, where the segment's ID would be a "key" to identify the tree. So, e.g. if we have segment _5g
, we could compute a star tree for it and name it _5g
's star tree.
Of course, if we need to support deletes, it wouldn't necessarily be tied to a segment, but rather a SegmentCommitInfo
, kind of like live docs (which are identified by a segment generation and a commit generation).
Thanks that makes sense. Let me explore this a bit more and come up with pros and cons.
Evaluated the idea of side car index (outside associated Lucene segment ) but maintaining two separate readers/writers etc will be an additional overhead. Sorting/aggregation while building star tree also will be quite resource intensive if performed on larger segments. So it'll be hard to maintain real-time sync with original index.
But, creating star tree index for existing segments is a pretty good use case, we can again reevaluate and think of better approaches in later cycles.
Meanwhile, I have updated the Lucene Github issue expanding the proposal for new format which can create multi-field indices based on existing fields (DocValues) as part of indexing flow.
To have initial implementation as an experimental feature, and to use existing interfaces of Lucene without getting blocked on new changes , I propose following implementation in OpenSearch.
We can extend PerFieldMappingPostingFormatCodec
or introduce a similar codec with custom implementation for PerFieldDocValuesFormat
public class PerFieldMappingPostingAndDVFormatCodec extends Lucene99Codec {
private final Logger logger;
private final MapperService mapperService;
private final DocValuesFormat starTreeFormat = new StarTreeDocValuesFormat();
@Override
public DocValuesFormat getDocValuesFormatForField(String field) {
if ( this field is part of star tree configs ( one of dims / metrics of the fields ) )
return new StarTreeDocValuesFormat();
return super.getDocValuesFormatForField(field);
}
}
public class StarTreeDocValuesFormat extends DocValuesFormat {
public StarTreeDocValuesFormat() {
super("startree");
}
@Override
public DocValuesConsumer fieldsConsumer(SegmentWriteState state) throws IOException {
return new StarTreeDocValuesWriter(state.segmentInfo.getCodec().docValuesFormat().fieldsConsumer(state), state);
}
@Override
public DocValuesProducer fieldsProducer(SegmentReadState state) throws IOException {
return new StarTreeDocValuesReader(state.segmentInfo.getCodec().docValuesFormat().fieldsProducer(state), state);
}
}
This includes changes for building star tree as part of ‘flush’ and ‘merge’ through custom doc values consumer.
Flush : The idea is to have a check as part of each doc values field’s flush , if we can build star tree using the indexed fields so far. If so we will build star tree.
We can optimize this further to call build star tree after all DocValues fields are indexed as well ( based on state.fieldInfos )
public class StarTreeDocValuesWriter extends DocValuesConsumer {
Map<String, SortedNumericDocValues> numericDocValuesReaders = new HashMap();
Map<String, SortedSetDocValues> sortedSetDocValuesReaders = new HashMap();
public StarTreeDocValuesWriter(
DocValuesConsumer delegate, SegmentWriteState segmentWriteState) {
// Fetch star tree fields from segmentinfo attributes map
List<StarTreeField> starFields = getStarTreeFields(segmentWriteState.segmentInfo.getAttributes());
}
/**
* Flush related changes
**/
@Override
public void addSortedNumericField(FieldInfo field, DocValuesProducer valuesProducer) throws IOException {
delegate.addSortedNumericField(field, valuesProducer);
numericDocValuesReaders.put(field.name, valuesProducer.getSortedNumeric(field));
for(StarTreeField starField : starFields) {
if(!isMerge && canBuildStarTree(starField)) {
buildStarTree(starField);
}
}
}
@Override
public void addSortedSetField(FieldInfo field, DocValuesProducer valuesProducer) throws IOException {
delegate.addSortedSetField(field, valuesProducer);
textDocValuesReaders.put(field.name, valuesProducer.getSortedSet(field));
for(StarTreeField starField : starFields) {
if(!isMerge && canBuildStarTree(starField)) {
buildStarTree(starField);
}
}
}
private void canBuildStarTree(StarTreeField starField) {
// if all the dimensions fields and metrics fields of this star tree field
// are present as part of 'text/numericDocValuesReaders', then return 'true'
// as we will be able to build star tree using these indexed values
}
private void buildStarTree(StarTreeField field) {
// Build star tree
// Mark this star tree field 'Indexed' for this field
}
/**
* Merge related changes
**/
@Override
public void merge(MergeState mergeState) throws IOException {
// marking this flag as true as merge also calls "add..Field" methods
// and we have a different way to buid star tree for merge
isMerge = true;
super.merge(mergeState);
isMerge = false;
createStarTreeMerge(mergeState);
}
}
We can have an interface which gives the ‘ReadStarTree’ capability to the custom doc values readers.
/**
* DocValues readers containing custom logic for star tree can implement this interface
**/
public interface StarTreeReader {
StarTreeAggregatedValues getStarTreeValues(String field) throws IOException;
}
/**
* Custom doc values reader for star tree also implementing StarTreeReader
**/
public class StarTreeDocValuesReader extends DocValuesProducer implements StarTreeReader {
public StarTreeDocValuesReader() {
// Read/Initialize all doc values fields
// Read/Initialize all star tree fields
}
.... all docValuesMethods ....
/**
* The problem of how to access the above star tree fields will be solved
* via this method
**/
@Override
public StarTreeAggregatedValues getStarTreeValues(String field)
throws IOException {
return getStarTreeValues(field);
}
}
We will get the ‘DocValuesReader’ from the segment and during ‘StarTreeQuery/Aggregation’ , we can check if its of type ‘StarTreeReader’ and if so, we will be able to read the associated data structures.
public class StarTreeQuery extends Query {
@Override
public Scorer scorer(LeafReaderContext context) throws IOException {
SegmentReader reader = Lucene.segmentReader(context.reader());
// We get the 'StarTreeReader' instance so that we can get StarTreeValues
if(!(reader.getDocValuesReader() instanceof StarTreeReader)) return emptyScorer();
StarTreeReader starTreeDocValuesReader = (StarTreeReader) reader.getDocValuesReader();
StarTreeAggregatedValues starTreeValues = starTreeDocValuesReader.getStarTreeValues(field);
// Custom logic using above starTreeValues
}
}
public class StarTreeAggregator extends BucketsAggregator {
@Override
protected LeafBucketCollector getLeafCollector(LeafReaderContext ctx, LeafBucketCollector sub) throws IOException {
SegmentReader reader = Lucene.segmentReader(ctx.reader());
if(!(reader.getDocValuesReader() instanceof StarTreeReader)) return null;
StarTreeReader starTreeDocValuesReader = (StarTreeReader) reader.getDocValuesReader();
StarTreeAggregatedValues starTreeValues = starTreeDocValuesReader.getStarTreeValues(field);
// Collect using starTree's doc values
}
}
Thanks @bharath-techie for sharing this approach. I think this provides a clear way to achieve this within OpenSearch. In order to support aggregations and queries using the existing interfaces, we will need changes in both the query rewriting, and aggregations framework I guess. Can we come up with a high level idea of the changes required?
Tagging @shwetathareja @msfroh @backslasht @jainankitk as well to gather thoughts on this.
@bharath-techie -- I need to dig into your POC code, but I really like this approach to let us fast-track the feature in OpenSearch while we work on adding it in Lucene.
@bharath-techie - Thank you for writing this detailed proposal. I really like the scale of potential improvements with this index, but also wondering if the use case is generic enough to warrant separate index. I might be wrong, since yet to go through the paper, but some considerations:
That being said, this is really good stuff, given it is not default (customers need to explicitly enable it) and there would be some stats/multi-dimensional aggregation workloads.
Thanks for the detailed feedback @jainankitk. Let me try to address some of them.
Isn't it possible to account for deleted document within the star tree index during segment merge?
We've discussed here about some possible approaches for this. It has a cost attached to it, we need to see which is better overall.
Also, are we planning to have setting for disabling any updates/deletes completely if the customer has star tree index enabled?
There is a RFC which discusses this.
I not sure if minute/hour/day/year suffice
This is a just a case for POC, we can extend it to all buckets search currently supports. For instance, Hour bucket in POC gives same buckets as hourly histogram, that was the intention which can be extended to all the other granularities.
We already have 4 dimensions for date which is common field in most documents, the moment I add few other fields the size of tree blows up and gains start reducing
Good point. We've not locked in the implementation for date
- we could also go with just supporting say 'minute' and then deriving higher time aggregations during query time if its performant and also we can have multiple star trees for different time histograms.
The size/performance of tree depends more on cardinality of fields more compared to number of fields. (more fields with less cardinality still worked well during our benchmarks )
But your point is valid , we will try to come up with approx. estimates of the resultant size of star tree and performance based on cardinality and number of fields. ( This is something I got as feedback during multiple discussions as well )
For example - hourly aggregation is already 100x faster. So, would love to see the benchmark with latest version 2.14 is possible.
I did another round of benchmarks on 2.14 - date histograms are indeed blazing fast 🔥 But when we couple other aggregations together - say with status terms or sum aggs etc, then its again slow as the performance is proportional to number of documents collected/traversed via DocValues. This is discussed to some extent previously in the same RFC (as part of weight.count optimization )
Benchmarks done on Amazon AWS EC2 instance of r5.4x large on 2.14
Query (aggregation) | Median throughput | 90th percentile latency |
---|---|---|
Hourly histogram with status terms | 0.21 ops/s | 4.75 seconds |
Monthly histogram with status terms | 0.28 ops/s | 3.65 seconds |
Yealy histogram with status terms | 0.28 ops/s | 3.65 seconds |
Config : minute, hour, month, year, clientIp( 1.1 million cardinality ), status
Query (aggregation) | Median throughput | 90th percentile latency |
---|---|---|
Hourly histogram with status terms | 3.18 ops/s | 329 ms |
Monthly histogram with status terms | 3.46 ops/s | 306 ms |
Yealy histogram with status terms | 3.51 ops/s | 299 ms |
Our opensearch/lucene queries are already so optimized, but for most of the complex aggregations, efficiency will be capped based on the amount of docs to traverse/collect, which the star tree solves.
The RFC so far doesn't capture results for terms / multi-terms aggregations ( need to address some gaps like top n aggs, handling high cardinality terms etc ), the cases we found to be the most performant .
@bharath-techie - Thank you for patiently addressing my comments and sharing some numbers with multi-field aggregation.
Good point. We've not locked in the implementation for date - we could also go with just supporting say 'minute' and then deriving higher time aggregations during query time if its performant
We can't go about storing just 'minute' right? Because then it needs to be separate bucket for every minute for different hour, day, year. I might be wrong since my understanding of star tree is limited.
and also we can have multiple star trees for different time histograms.
Yeah, this is an option. But again, we need to be careful about how much are we asking the customers to configure. Ideally, we would like to have workload tailored indices for most optimal performance, but we choose BKD, term dictionaries because they work for most use cases and customer need not worry about them
Our opensearch/lucene queries are already so optimized, but for most of the complex aggregations, efficiency will be capped based on the amount of docs to traverse/collect, which the star tree solves.
+1. Iterating through each of the documents is indeed very slow! :(
This is a just a case for POC, we can extend it to all buckets search currently supports. For instance, Hour bucket in POC gives same buckets as hourly histogram, that was the intention which can be extended to all the other granularities.
Maybe we can try POC covering some or most of the date histogram aggregations.
We can't go about storing just 'minute' right? Because then it needs to be separate bucket for every minute for different hour, day, year.
Not sure I fully understand but to elaborate on the idea - during query time, we can get docs of all the nodes of 'minute' dimension and during collection of the docs as part of 'aggregation' we bucketize it as per the histogram requested. The tradeoff is higher query latency but user can save storage space. ( as we need to go through all 'minute' dimension nodes and hence collect/traverse higher number of docs )
we need to be careful about how much are we asking the customers to configure
Ya +1 , we will integrate with templates initially and also other such mechanisms to suggest the best configuration to the users. And multiple star trees is something we will incrementally expose to customers with base functionalities already built to support them.
we can try POC covering some or most of the date histogram aggregations.
Do you see any challenges - If rounding off timestamp while building star tree index is of same logic as of query, we should be able to do it right ?
Not sure I fully understand but to elaborate on the idea - during query time, we can get docs of all the nodes of 'minute' dimension and during collection of the docs as part of 'aggregation' we bucketize it as per the histogram requested.
I am assuming you want to have, say only status and minute dimensions from above example. In that case, the buckets for minute dimension cannot be only 0-59 or 1-60. Because you need to further qualify each of those minutes into hours/day. Else, how can I answer query like minutely aggregation on range query from 2024-01-01 to 2024-01-02 without iterating over all the documents?
In that case, the buckets for minute dimension cannot be only 0-59 or 1-60.
Ah I get your question , I'm not making the minute to 0-59 but rather I'm rounding off the epoch millis/seconds to epoch minute. We'll again be able to round off epoch minute to epoch hour or higher time dimensions during aggregation and bucketize as per the query.
Code reference in star tree
Rounding directly based on this in datehistogram . In fact we can try reusing this same class if possible.
Apart from this, I think there is a way to give fixed interval as well in histogram, which we'll need to evaluate.
Ah I get your question , I'm not making the minute to 0-59 but rather I'm rounding off the epoch millis/seconds to epoch minute.
But doesn't make the cardinality "too high". Just thinking about few months/yearly log data, the cardinality becomes ~100k for 2 months
Rounding directly based on this in datehistogram . In fact we can try reusing this same class if possible.
Rounding is slightly expensive operation, still much faster than the individual document iteration though. Maybe we can have multi range traversal logic on star tree similar to https://github.com/apache/lucene/issues/13335, assuming we can traverse the minutely rounded epochs in sorted order
But doesn't make the cardinality "too high". Just thinking about few months/yearly log data, the cardinality becomes ~100k for 2 months
A lot of users with timeseries type workload make a per day index, something like logs-24-11-2024
. Since each index will have it's own star tree for the set of dimension/metrics, it will come down to the query part to handle these aggregation. Also, I feel with minutes is one option, but we can explore different ideas to handle queries at different time granularity.
I think from where @bharath-techie is coming from, we can still sort of control the cardinality if, in the dimension split order, we are keeping the timestamp at a higher level than other dimensions.
Something like this: timestamp -> status -> client_ip. If we do it the other way round (timestamp last), we would see a lot more branching.
Yes +1 , choosing the right dimensionSplitOrder controls the storage size.
But most importantly, in a particular segment - what will be the timestamp's cardinality ? For log/time series use cases, as long as we have efficient merge policies ( example : context aware segments ) , most likely we will only have few hours of data in a segment. So even if an index spans several months, in each segment cardinality of the timestamp field will be quite less.
Maybe we can have multi range traversal logic on star tree similar to https://github.com/apache/lucene/issues/13335,
Right now 'star tree' will act as the lead iterator for the query - so far in the design , we have star tree + doc values. So once star tree gives a set of documents for the dimensions which are in tree, we further filter out in doc values if there are any other dimensions in filter. ( not part of star tree as per maxLeafDocs threshold )
Sometime back, we evaluated the idea of using point tree as well for range queries alongside star tree but point tree works well as a lead iterator rather right ?
Something like this: timestamp -> status -> client_ip. If we do it the other way round (timestamp last), we would see a lot more branching. Yes +1 , choosing the right dimensionSplitOrder controls the storage size.
I am not completely sure what the right branching is, especially between timestamp/status. For example, if we have uniform distribution of status across timestamps, won't we have equal number of buckets irrespective of the order of timestamp/status? Also, maybe status first is better idea because you can quickly answer the count queries on just status code? Else you need to go through all the timestamp values?
Sometime back, we evaluated the idea of using point tree as well for range queries alongside star tree but point tree works well as a lead iterator rather right ?
Did we evaluate for the case of query and aggregation being on same field or even different ones?
But most importantly, in a particular segment - what will be the timestamp's cardinality ? For log/time series use cases, as long as we have efficient merge policies ( https://github.com/opensearch-project/OpenSearch/issues/13183 ) , most likely we will only have few hours of data in a segment.
Sorry my bad, overlooked the generation being at segment level while doing my analysis.
Also, maybe status first is better idea because you can quickly answer the count queries on just status code? Else you need to go through all the timestamp values?
The paper highly recommends that we order the dimensions from highest cardinality to lowest cardinality for better pruning during query.
That said, we did benchmarks with both during POC , with status first
we noticed significantly higher storage size. ( > 70% if I remember correctly , since timestamp which has high cardinality is present across each status)
And from query perspective , status first
indeed performed better for status specific aggs [without any range filter ] but for range queries and date histogram we either found timestamp first
to be faster or within a 5-10% range off from status first
performance.
Query | Status first throughput | Timestamp first throughput |
---|---|---|
200s in range | 91.74 ops/s | 134.45 ops/s |
400s in range | 154.12 ops/s | 139.19 ops/s |
400s-agg | 186 ops/s | 112.82 ops/s |
200s-agg | 177 ops/s | 85.45 ops/s |
This is something we will continue to tune based on benchmarks / load testing before we rollout to production.
Note : the OSB benchmarks provide out of order data , so timestamp has super high cardinality in all segments. Probably when timestamp has lower cardinality similar to production time-series workloads or when coupled with context aware segments for example, we ideally can put the user dimensions at top and timestamp columns below it.
Did we evaluate for the case of query and aggregation being on same field or even different ones?
We usually evaluate with query and aggs being on different fields, but I get where you're coming from based on histogram improvements.
We just bounced off ideas, didn't get around to any implementation. Because we found star tree
traversal to be decently fast - on a query on multiple dimensions especially - the latency again mainly came from doc values collection.
This is something we will continue to tune based on benchmarks / load testing before we rollout to production.
Agreed. Needs more experimentation.
Note : the OSB benchmarks provide out of order data , so timestamp has super high cardinality in all segments. Probably when timestamp has lower cardinality similar to production time-series workloads or when coupled with context aware segments for example, we ideally can put the user dimensions at top and timestamp columns below it.
Good point, I keep forgetting about that. I am assuming that is because the sorted input data is partitioned to multiple threads?
Good point, I keep forgetting about that. I am assuming that is because the sorted input data is partitioned to multiple threads?
Yes that's right. But, looks like folks run multiple instances of benchmark with numclients:1
to bypass this limitation. But so far all benchmarks are done without such modifications.
We are starting with implementation of this RFC and the meta issue tracks the list of issues and milestones planned for near future.
Please feel free to contribute / review / comment on this.
Is your feature request related to a problem? Please describe
Aggregations are the most used query type in observability use cases and the aggregation is typically on metrics, request logs, etc. and spread across different fields. Heavy aggregation queries doesn’t have an upper bound in terms of time taken and resource consumption. In general, the performance of an aggregation query is relative to the number of documents.
Aggregations are faster when they are run on rolled up indices and transforms as they help in reducing granularity and providing materialized views. But, they are generally done once ingestion is complete, similar constructs doesn’t exists for live/active data.
Existing solutions
Aggregation framework
Existing aggregation framework in OpenSearch is quite powerful with bucket, pipeline and metric aggregations. It does support wide range of aggregations since it operates on the non altered live documents of the indices. And as it operates on live original data, deletes and updates are also supported.
While it works great, it has few cons
Index rollups / Index transforms
Index rollup jobs lets users reduce data granularity by rolling up old data into condensed indices, transform jobs lets users create a different, summarized view of users' data centered around certain fields. These solutions are generally used to save storage space and also the queries are faster as the number of documents are reduced.
Cons:
Describe the solution you'd like
Star tree index
I’m exploring the use case of pre-computing the aggregations using Star Tree index while indexing the data based on the configured fields (dimensions and metrics) during index creation. This is inspired from http://hanj.cs.illinois.edu/pdf/vldb03_starcube.pdf and Apache Pinot’s Star Tree index. Star Tree helps to enforce upper bound on the aggregation queries ensuring predictable latency and resource usage, it is also storage space efficient and configurable.
Star Tree index is a multi-field index, contrary to existing index structures in Lucene and OpenSearch which are on single field.
Improvements with Star Tree index
While it provides the above improvements, it does have its limitations, they are:
Given the above details, it does makes a case on why Star Tree is valuable to OpenSearch, I’m working on prototyping it. I’ll follow up on this issue with prototype details along with the benchmark results.
Related component
Search:Aggregations
Describe alternatives you've considered
Mentioned in the Existing Solutions.
Additional context
No response