jaegertracing / jaeger

CNCF Jaeger, a Distributed Tracing Platform
https://www.jaegertracing.io/
Apache License 2.0
20.28k stars 2.42k forks source link

Storage backends for adaptive sampling #3305

Open yurishkuro opened 3 years ago

yurishkuro commented 3 years ago

Since v1.27 adaptive sampling is supported in the backend, but it only works with Cassandra as the backing store. We need to implement it for other types of stores, e.g.

srikanthccv commented 2 years ago

I wanted to try out this feature but realised not supported for different backends. I can take a stab at this if nobody is already working on it.

albertteoh commented 2 years ago

That would be appreciated, @lonewolf3739.

james-ryans commented 1 year ago

Hi, does anyone working on this? I would like to work on Elasticsearch storage support.

james-ryans commented 1 year ago

I have some questions before I start implementing the feature.

  1. What is the purpose of the bucket column in the operation_throughput and sampling_probabilities tables in the Cassandra storage backend? Is it solely for performance, or are there other considerations I'm missing?
  2. Do I need to use index-per-day pattern? Do I need to support rollover and index-cleaner for adaptive sampling?

Here is my idea to store the document, feedbacks are welcome!

jaeger-throughputs Is it better if we encode the service, operation, count, and probabilities field into a single string? Since we only query the timestamp field

// mapping
{
  "mappings": {
    "properties": {
      "timestamp": {
        "type": "long"
      },
      "service": {
        "type": "keyword",
        "index": false
      },
      "operation": {
        "type": "keyword",
        "index": false
      },
      "count": {
        "type": "long",
        "index": false
      },
      "probabilities": {
        "type": "keyword",
        "index": false
      }
    }
  }
}
// example
{
  "timestamp": 1485467191639875,
  "service": "svc",
  "operation": "op",
  "count": 40,
  "probabilities": ["0.1", "0.5"]
}

jaeger-probabilities-and-qps

// mapping
{
  "mappings": {
    "properties": {
      "timestamp": {
        "type": "long"
      },
      "hostname": {
        "type": "keyword",
        "index": false
      },
      "probabilities": {
        "type": "object",
        "dynamic": false,
        "properties": {
          "operations": {
            "type": "object",
            "dynamic": false,
            "properties": {
              "operation": {
                "type": "keyword",
                "index": false
              },
              "probability": {
                "type": "keyword",
                "index": false
              },
              "qps": {
                "type": "long",
                "index": false
              }
            }
          },
          "service": {
            "type": "keyword",
            "index": false
          }
        }
      }
    }
  }
}
// example
{
  "timestamp": 1485467191639875,
  "hostname": "localhost",
  "probabilities": [
    {
      "service": "svc",
      "operations": [
        {
          "operation": "op1",
          "probability": "0.1",
          "qps": 40
        },
        {
          "operation": "op2",
          "probability": "0.2",
          "qps": 50
        }
      ]
    },
    {
      "service": "another_svc",
      "operations": [
        {
          "operation": "op3",
          "probability": "0.4",
          "qps": 20
        },
        {
          "operation": "op4",
          "probability": "0.5",
          "qps": 30
        }
      ]
    }
  ]
}

Since Elasticsearch 5+ does not support _ttl mapping, my idea to overcome the limitation is to store expire_timestamp to indicate if the lease is expired when we retrieve it. This approach is highly feasible if we need to support an index-per-day pattern, which can be easily scaled with es-rollover and es-index-cleaner. One of the biggest advantages of this solution is that it supports milliseconds(or microseconds) granularity.

jaeger-leases

// mapping
{
  "mappings": {
    "properties": {
      "name": {
        "type": "keyword"
      },
      "owner": {
        "type": "keyword"
      },
      "expire_timestamp": {
        "type": "long"
      }
    }
  }
}
// example
{
  "name": "sampling_store_leader",
  "owner": "localhost",
  "expire_timestamp": 1681998717000000
}
yurishkuro commented 1 year ago

What is the purpose of the bucket column in the operation_throughput and sampling_probabilities tables in the Cassandra storage backend? Is it solely for performance, or are there other considerations I'm missing?

bucket in Cassandra is used to avoid hot spots in the hash ring (bucket is a random number 1..n), because without this field the primary key is just the timestamp, and all collectors write sampling data at the same time.

Do I need to use index-per-day pattern? Do I need to support rollover and index-cleaner for adaptive sampling?

I think it should be treated as any other index. The main difference in sampling data from the trace/span data is that while they all always growing, the sampling is only valuable for the last N writes. The LAST write is the most important as it provides the initial seed of the probabilities, while N last writes are used to compute the next iteration of sampling probabilities (e.g. using exponential decay of the older data). In theory, the whole adaptive sampling storage can be modeled with these N slots (in a round robin fashion), but in practice we found it useful to keep the history for a few days in order to investigate how sampling rates change over time. Hence my suggestion to use the same TTL / rotation / rollover as the main span indices (also makes the implementation simpler & maintenance streamlined).

slayer321 commented 1 year ago

Hey @yurishkuro , I did like to work on Implementing Badger storage support. Currently I'm going through the memory-only and Cassandra Implementation will share more on Badger Implementation in some time.

yurishkuro commented 1 year ago

@slayer321 I would strongly recommend starting with adding new tests in the storage e2e integration test, which today does not cover sampling storage. Then you will have a clear blueprint of what needs to be implemented in another backend.

Pushkarm029 commented 8 months ago

I would like to implement Adaptive Sampling support for Elasticsearch.

akagami-harsh commented 8 months ago

hey @Pushkarm029, are you working on it?

Pushkarm029 commented 8 months ago

@akagami-harsh, yeah, I am halfway. I will complete it within 2-3 days.

Pushkarm029 commented 7 months ago

Should we update the documents to reflect the current state?

Adaptive sampling requires a storage backend to store the observed traffic data and computed probabilities. At the moment memory (for all-in-one deployment) and cassandra are supported as sampling storage backends. We are seeking help in implementing support for other backends ( tracking issue  ).

https://www.jaegertracing.io/docs/1.54/sampling/#adaptive-sampling

yurishkuro commented 7 months ago

yes

gmandrade21 commented 6 months ago

@yurishkuro somebody is working now for the Opensearch backend in this feature?

yurishkuro commented 6 months ago

OpenSearch is already supported via Elasticsearch code (they are the same)

rsafonseca commented 5 months ago

Is it really supported?

When I try to start jaeger-collector (tested with 1.55.0 and 1.56.0) with SAMPLING_STORAGE_TYPE=elasticsearch I get the following:

{"level":"fatal","ts":1712826901.3422914,"caller":"collector/main.go:92","msg":"Failed to create sampling store factory","error":"storage factory of type elasticsearch does not support sampling store","stacktrace":"main.main.func1\n\tgithub.com/jaegertracing/jaeger/cmd/collector/main.go:92\ngithub.com/spf13/cobra.(*Command).execute\n\tgithub.com/spf13/cobra@v1.8.0/command.go:983\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tgithub.com/spf13/cobra@v1.8.0/command.go:1115\ngithub.com/spf13/cobra.(*Command).Execute\n\tgithub.com/spf13/cobra@v1.8.0/command.go:1039\nmain.main\n\tgithub.com/jaegertracing/jaeger/cmd/collector/main.go:157\nruntime.main\n\truntime/proc.go:271"}

In addition, according to the docs "By default adaptive sampling will attempt to use the backend specified by SPAN_STORAGE_TYPE to store data." But when if i set SPAN_STORAGE_TYPE=elasticsearch and don't set SAMPLING_STORAGE_TYPE, i get this when starting the collector:

{"level":"fatal","ts":1712825412.326171,"caller":"collector/main.go:97","msg":"Failed to init sampling strategy store factory","error":"sampling store factory is nil. Please configure a backend that supports adaptive sampling","stacktrace":"main.main.func1\n\tgithub.com/jaegertracing/jaeger/cmd/collector/main.go:97\ngithub.com/spf13/cobra.(*Command).execute\n\tgithub.com/spf13/cobra@v1.8.0/command.go:983\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tgithub.com/spf13/cobra@v1.8.0/command.go:1115\ngithub.com/spf13/cobra.(*Command).Execute\n\tgithub.com/spf13/cobra@v1.8.0/command.go:1039\nmain.main\n\tgithub.com/jaegertracing/jaeger/cmd/collector/main.go:157\nruntime.main\n\truntime/proc.go:271"}

yurishkuro commented 5 months ago

@Pushkarm029 can you please take a look at this ^ report?

Pushkarm029 commented 5 months ago

@Pushkarm029 can you please take a look at this ^ report?

👀looking into it.