Optimizing Data Storage and Retrieval for Time Series data.

nkumar04 commented 1 year ago

Is your feature request related to a problem? Please describe. In OpenSearch, documents are stored using three primary formats: indexed, stored, and docValues. In case of time series data, a document usually consists of dimensions, time point and quantitative measurements that are used monitor various aspects of a system, process, or phenomenon. In cases like these, the numeric, keyword and similar datatypes are stored as the "stored" field as well as docValues that serves specific purposes related to search performance and retrieval. DocValues are a columnar storage format used by Lucene to store indexed data in a way that facilitates efficient aggregation, sorting etc and stored fields, on the other hand, are used to store the actual values of fields as they were inserted into the index. For example, lets look at a document consists of performance related data points of an ec2 instance.

{
    "hostName": "xyz",
    "hostIp": "x.x.x.x",
    "zone" : "us-east-1a"
    "@timestamp": 1693192062,
    "cpu": 80
    "jvm": 85
}

Here, hostName, hostIp and zone uses keyword as field type and timestamp/cpu/jvm uses numeric fields. Values for dimensions and mesurements are stored in docValue as well in stored fields as _source. The most common search query for such data set is aggregations like min/max/avg/sum. As we can see, data is stored twice here, we can possibly avoid storing data twice.

Describe the solution you'd like Currently the _source field stores the original documents as stored field. We can possibly skip storing the _source field in such cases and retrieve the field values from docValue instead. This will help in reducing the storage cost significantly. Based on the nature of the query we can skip or fetch some or all of the fields from docValues to serve the search queries.

nkumar04 commented 1 year ago

@shwetathareja

mgodwan commented 1 year ago

While this may provide gains in storage/throughput, it introduces few trade-offs:

Execution of painless scripts: Painless scripts may rely on _source field.
Re-indexing relies on _source fields as well. There can be cases where customers can choose to ignore any fields for which mappings may not be declared but they are still stored in _source field. These fields will then not be a part of the storage.

These use cases may apply for time-series data, and should be evaluated to support with the proposed optimization.

shwetathareja commented 1 year ago

This should also reduce the merging overhead if _source is not stored explicitly.

+1 to @mgodwan : As we start looking into solution, we need to carefully analyze the restriction it would bring in terms of use cases. Also if those restriction would work for time series workload specifically. Instead of skipping the _source completely, should we consider filtering the _source for specific field for which doc_Values is enabled.

msfroh commented 11 months ago

Can we try benchmarking something like this using the existing code?

It's already possible to exclude some/all fields from source and it's possible to explicitly request that fields be loaded from doc values.

That is, users can theoretically tune their index to do exactly what's being proposed here -- it sounds like we just want to make it easier (e.g. by defining a time series index type that excludes doc values fields from source and automatically retrieves those doc values fields at query time).

I think the proposed change could help a lot. We could measure that improvement now by making some changes to the http_logs workload, I think.

msfroh commented 11 months ago

Related, if we can make retrieving from doc values "feel" the same as retrieving from source, we could transparently retrieve doc values instead of source if only fields with doc values are requested (to avoid decompressing the stored field block altogether).

shwetathareja commented 11 months ago

Related, if we can make retrieving from doc values "feel" the same as retrieving from source, we could transparently retrieve doc values instead of source if only fields with doc values are requested (to avoid decompressing the stored field block altogether).

Right, @msfroh , yes agreed this optimization can help in all cases ( _source enabled/ disabled)

nkumar04 commented 11 months ago

@msfroh , you are right we can exclude/include fields from source and request fields to be explicitly loaded from docValues. I tried benchmarking the indexing performance to test exclusion of fields that are also stored as docValue(mapping below).

Findings:

Indexing throughput degradation by ~12%
Latency degradation P50: ~15%
Latency degradation P90: ~13%
Storage gain: ~4%

The overhead in this case is mostly due to rewriting of _source at the time of indexing. (link)

In case, _source is completely omitted, there is ~25% gain in storage and slight gain in indexing throughput (~3-4%) and (~3%) gain in P50/P90 latency.

I am yet to benchmark the query performance (explicitly request that fields be loaded from doc values.).

That is, users can theoretically tune their index to do exactly what's being proposed here -- it sounds like we just want to make it easier (e.g. by defining a time series index type that excludes doc values fields from source and automatically retrieves those doc values fields at query time).

Yes, the idea is to make the these optimisation under the hood and make it default for timeseries indexes. We can keep source field disabled by default if all fields can be fetched from docValues and source can be generated at the query time. In case, some of the fields are not of docValue type, then we can store some of them as stored field and at the query time we can fetch partially from docValues and partially from stored field if _source is requested (depends the nature of the query)

git diff index.json
diff --git a/http_logs/index.json b/http_logs/index.json
index dfe1cd0..ea98d3a 100644
--- a/http_logs/index.json
+++ b/http_logs/index.json
@@ -7,7 +7,12 @@
   "mappings": {
     "dynamic": "strict",
     "_source": {
-      "enabled": {{ source_enabled | default(true) | tojson }}
+      "excludes": [
+        "@timestamp",
+        "clientip",
+        "status",
+        "size"
+      ]
     },
     "properties": {
       "@timestamp": {

msfroh commented 11 months ago

I'm going to open a spin-off issue to target just the querying side of this, because I think doing doc value retrieval to avoid decompressing stored source might be a quick win (independent of the indexing changes).

opensearch-project / OpenSearch

Optimizing Data Storage and Retrieval for Time Series data. #9568