Open yurishkuro opened 3 years ago
I wanted to try out this feature but realised not supported for different backends. I can take a stab at this if nobody is already working on it.
That would be appreciated, @lonewolf3739.
Hi, does anyone working on this? I would like to work on Elasticsearch storage support.
I have some questions before I start implementing the feature.
bucket
column in the operation_throughput
and sampling_probabilities
tables in the Cassandra storage backend? Is it solely for performance, or are there other considerations I'm missing?Here is my idea to store the document, feedbacks are welcome!
jaeger-throughputs
Is it better if we encode the service
, operation
, count
, and probabilities
field into a single string? Since we only query the timestamp
field
// mapping
{
"mappings": {
"properties": {
"timestamp": {
"type": "long"
},
"service": {
"type": "keyword",
"index": false
},
"operation": {
"type": "keyword",
"index": false
},
"count": {
"type": "long",
"index": false
},
"probabilities": {
"type": "keyword",
"index": false
}
}
}
}
// example
{
"timestamp": 1485467191639875,
"service": "svc",
"operation": "op",
"count": 40,
"probabilities": ["0.1", "0.5"]
}
jaeger-probabilities-and-qps
// mapping
{
"mappings": {
"properties": {
"timestamp": {
"type": "long"
},
"hostname": {
"type": "keyword",
"index": false
},
"probabilities": {
"type": "object",
"dynamic": false,
"properties": {
"operations": {
"type": "object",
"dynamic": false,
"properties": {
"operation": {
"type": "keyword",
"index": false
},
"probability": {
"type": "keyword",
"index": false
},
"qps": {
"type": "long",
"index": false
}
}
},
"service": {
"type": "keyword",
"index": false
}
}
}
}
}
}
// example
{
"timestamp": 1485467191639875,
"hostname": "localhost",
"probabilities": [
{
"service": "svc",
"operations": [
{
"operation": "op1",
"probability": "0.1",
"qps": 40
},
{
"operation": "op2",
"probability": "0.2",
"qps": 50
}
]
},
{
"service": "another_svc",
"operations": [
{
"operation": "op3",
"probability": "0.4",
"qps": 20
},
{
"operation": "op4",
"probability": "0.5",
"qps": 30
}
]
}
]
}
Since Elasticsearch 5+ does not support _ttl
mapping, my idea to overcome the limitation is to store expire_timestamp
to indicate if the lease
is expired when we retrieve it. This approach is highly feasible if we need to support an index-per-day pattern, which can be easily scaled with es-rollover and es-index-cleaner. One of the biggest advantages of this solution is that it supports milliseconds(or microseconds) granularity.
jaeger-leases
// mapping
{
"mappings": {
"properties": {
"name": {
"type": "keyword"
},
"owner": {
"type": "keyword"
},
"expire_timestamp": {
"type": "long"
}
}
}
}
// example
{
"name": "sampling_store_leader",
"owner": "localhost",
"expire_timestamp": 1681998717000000
}
What is the purpose of the bucket column in the
operation_throughput
andsampling_probabilities
tables in the Cassandra storage backend? Is it solely for performance, or are there other considerations I'm missing?
bucket
in Cassandra is used to avoid hot spots in the hash ring (bucket is a random number 1..n), because without this field the primary key is just the timestamp, and all collectors write sampling data at the same time.
Do I need to use index-per-day pattern? Do I need to support rollover and index-cleaner for adaptive sampling?
I think it should be treated as any other index. The main difference in sampling data from the trace/span data is that while they all always growing, the sampling is only valuable for the last N writes. The LAST write is the most important as it provides the initial seed of the probabilities, while N last writes are used to compute the next iteration of sampling probabilities (e.g. using exponential decay of the older data). In theory, the whole adaptive sampling storage can be modeled with these N slots (in a round robin fashion), but in practice we found it useful to keep the history for a few days in order to investigate how sampling rates change over time. Hence my suggestion to use the same TTL / rotation / rollover as the main span indices (also makes the implementation simpler & maintenance streamlined).
Hey @yurishkuro , I did like to work on Implementing Badger storage support. Currently I'm going through the memory-only and Cassandra Implementation will share more on Badger Implementation in some time.
@slayer321 I would strongly recommend starting with adding new tests in the storage e2e integration test, which today does not cover sampling storage. Then you will have a clear blueprint of what needs to be implemented in another backend.
I would like to implement Adaptive Sampling support for Elasticsearch.
hey @Pushkarm029, are you working on it?
@akagami-harsh, yeah, I am halfway. I will complete it within 2-3 days.
Should we update the documents to reflect the current state?
Adaptive sampling requires a storage backend to store the observed traffic data and computed probabilities. At the moment memory (for all-in-one deployment) and cassandra are supported as sampling storage backends. We are seeking help in implementing support for other backends ( tracking issue  ).
https://www.jaegertracing.io/docs/1.54/sampling/#adaptive-sampling
yes
@yurishkuro somebody is working now for the Opensearch backend in this feature?
OpenSearch is already supported via Elasticsearch code (they are the same)
Is it really supported?
When I try to start jaeger-collector (tested with 1.55.0 and 1.56.0) with SAMPLING_STORAGE_TYPE=elasticsearch
I get the following:
{"level":"fatal","ts":1712826901.3422914,"caller":"collector/main.go:92","msg":"Failed to create sampling store factory","error":"storage factory of type elasticsearch does not support sampling store","stacktrace":"main.main.func1\n\tgithub.com/jaegertracing/jaeger/cmd/collector/main.go:92\ngithub.com/spf13/cobra.(*Command).execute\n\tgithub.com/spf13/cobra@v1.8.0/command.go:983\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tgithub.com/spf13/cobra@v1.8.0/command.go:1115\ngithub.com/spf13/cobra.(*Command).Execute\n\tgithub.com/spf13/cobra@v1.8.0/command.go:1039\nmain.main\n\tgithub.com/jaegertracing/jaeger/cmd/collector/main.go:157\nruntime.main\n\truntime/proc.go:271"}
In addition, according to the docs "By default adaptive sampling will attempt to use the backend specified by SPAN_STORAGE_TYPE to store data."
But when if i set SPAN_STORAGE_TYPE=elasticsearch
and don't set SAMPLING_STORAGE_TYPE, i get this when starting the collector:
{"level":"fatal","ts":1712825412.326171,"caller":"collector/main.go:97","msg":"Failed to init sampling strategy store factory","error":"sampling store factory is nil. Please configure a backend that supports adaptive sampling","stacktrace":"main.main.func1\n\tgithub.com/jaegertracing/jaeger/cmd/collector/main.go:97\ngithub.com/spf13/cobra.(*Command).execute\n\tgithub.com/spf13/cobra@v1.8.0/command.go:983\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tgithub.com/spf13/cobra@v1.8.0/command.go:1115\ngithub.com/spf13/cobra.(*Command).Execute\n\tgithub.com/spf13/cobra@v1.8.0/command.go:1039\nmain.main\n\tgithub.com/jaegertracing/jaeger/cmd/collector/main.go:157\nruntime.main\n\truntime/proc.go:271"}
@Pushkarm029 can you please take a look at this ^ report?
@Pushkarm029 can you please take a look at this ^ report?
👀looking into it.
Since v1.27 adaptive sampling is supported in the backend, but it only works with Cassandra as the backing store. We need to implement it for other types of stores, e.g.