Closed sboisson closed 3 years ago
Is anyone working on this? I or my team can maybe give a shot at this.
as i now noone working on this
@bzon i can join you as tester
@bzon i can join you as tester
Sure!
Happy to see someone working on this :) Might be able to join as tester
@bzon would be good if you post an architecture here, specifically how you would lay out the data, ingestion, etc. To my knowledge, clickhouse requires batched writes, and it may even be up to you to decide which node to send the writes to, so there are many questions. It may require some benchmarking to find the optimal design.
@yurishkuro at the moment, we have zero knowledge with the internals of jaeger. I think that benchmarking should be the first step to see if this integration is feasible. And with that said, the first requirement should be creating the right table schema.
I think architecture of project https://github.com/flant/loghouse could be a source of inspiration…
This webinar could be interesting: A Practical Introduction to Handling Log Data in ClickHouse
I took a stab at it (very early WIP):
This is the schema I used:
CREATE TABLE jaeger_index (
timestamp DateTime64(6),
traceID FixedString(16),
service LowCardinality(String),
operation LowCardinality(String),
durationUs UInt64,
tags Nested(
key LowCardinality(String),
valueString LowCardinality(String),
valueBool UInt8,
valueInt Int64,
valueFloat Float64
),
INDEX tags_strings (tags.key, tags.valueString) TYPE set(0) GRANULARITY 64,
INDEX tags_ints (tags.key, tags.valueInt) TYPE set(0) GRANULARITY 64
) ENGINE MergeTree() PARTITION BY toDate(timestamp) ORDER BY (timestamp, service, operation);
traceID
CREATE TABLE jaeger_spans (
timestamp DateTime64(6),
traceID FixedString(16),
model String
) ENGINE MergeTree() PARTITION BY toDate(timestamp) ORDER BY traceID;
You probably need Clickhouse 20.x for DateTime64
, I used 20.1.11.73
.
Index table looks like this:
SELECT *
FROM jaeger_index
ARRAY JOIN tags
ORDER BY timestamp DESC
LIMIT 20
FORMAT PrettyCompactMonoBlock
┌──────────────────timestamp─┬─traceID──────────┬─service──────┬─operation────┬─durationUs─┬─tags.key─────────────┬─tags.valueString─┬─tags.valueBool─┬─tags.valueInt─┬─tags.valueFloat─┐
│ 2020-05-10 20:43:23.248314 │ 212e5c616f4b9c2f │ jaeger-query │ getTraces │ 131605 │ num_trace_ids │ │ 0 │ 13 │ 0 │
│ 2020-05-10 20:43:23.248314 │ 212e5c616f4b9c2f │ jaeger-query │ getTraces │ 131605 │ weird │ │ 1 │ 0 │ 0 │
│ 2020-05-10 20:43:23.248314 │ 212e5c616f4b9c2f │ jaeger-query │ getTraces │ 131605 │ π │ │ 0 │ 0 │ 3.14 │
│ 2020-05-10 20:43:23.248314 │ 212e5c616f4b9c2f │ jaeger-query │ getTraces │ 131605 │ internal.span.format │ proto │ 0 │ 0 │ 0 │
│ 2020-05-10 20:43:23.248314 │ 212e5c616f4b9c2f │ jaeger-query │ getTraces │ 131605 │ jaeger.version │ Go-2.22.1 │ 0 │ 0 │ 0 │
│ 2020-05-10 20:43:23.248314 │ 212e5c616f4b9c2f │ jaeger-query │ getTraces │ 131605 │ hostname │ C02TV431HV2Q │ 0 │ 0 │ 0 │
│ 2020-05-10 20:43:23.248314 │ 212e5c616f4b9c2f │ jaeger-query │ getTraces │ 131605 │ ip │ 192.168.1.43 │ 0 │ 0 │ 0 │
│ 2020-05-10 20:43:23.248314 │ 212e5c616f4b9c2f │ jaeger-query │ getTraces │ 131605 │ client-uuid │ 7fc8f98ddbcd358c │ 0 │ 0 │ 0 │
│ 2020-05-10 20:43:23.065577 │ 212e5c616f4b9c2f │ jaeger-query │ FindTraceIDs │ 182728 │ internal.span.format │ proto │ 0 │ 0 │ 0 │
│ 2020-05-10 20:43:23.065577 │ 212e5c616f4b9c2f │ jaeger-query │ FindTraceIDs │ 182728 │ jaeger.version │ Go-2.22.1 │ 0 │ 0 │ 0 │
│ 2020-05-10 20:43:23.065577 │ 212e5c616f4b9c2f │ jaeger-query │ FindTraceIDs │ 182728 │ hostname │ C02TV431HV2Q │ 0 │ 0 │ 0 │
│ 2020-05-10 20:43:23.065577 │ 212e5c616f4b9c2f │ jaeger-query │ FindTraceIDs │ 182728 │ ip │ 192.168.1.43 │ 0 │ 0 │ 0 │
│ 2020-05-10 20:43:23.065577 │ 212e5c616f4b9c2f │ jaeger-query │ FindTraceIDs │ 182728 │ client-uuid │ 7fc8f98ddbcd358c │ 0 │ 0 │ 0 │
│ 2020-05-10 20:43:23.065574 │ 212e5c616f4b9c2f │ jaeger-query │ FindTraces │ 314349 │ internal.span.format │ proto │ 0 │ 0 │ 0 │
│ 2020-05-10 20:43:23.065574 │ 212e5c616f4b9c2f │ jaeger-query │ FindTraces │ 314349 │ jaeger.version │ Go-2.22.1 │ 0 │ 0 │ 0 │
│ 2020-05-10 20:43:23.065574 │ 212e5c616f4b9c2f │ jaeger-query │ FindTraces │ 314349 │ hostname │ C02TV431HV2Q │ 0 │ 0 │ 0 │
│ 2020-05-10 20:43:23.065574 │ 212e5c616f4b9c2f │ jaeger-query │ FindTraces │ 314349 │ ip │ 192.168.1.43 │ 0 │ 0 │ 0 │
│ 2020-05-10 20:43:23.065574 │ 212e5c616f4b9c2f │ jaeger-query │ FindTraces │ 314349 │ client-uuid │ 7fc8f98ddbcd358c │ 0 │ 0 │ 0 │
│ 2020-05-10 20:43:23.065535 │ 212e5c616f4b9c2f │ jaeger-query │ /api/traces │ 315554 │ sampler.type │ const │ 0 │ 0 │ 0 │
│ 2020-05-10 20:43:23.065535 │ 212e5c616f4b9c2f │ jaeger-query │ /api/traces │ 315554 │ sampler.param │ │ 1 │ 0 │ 0 │
└────────────────────────────┴──────────────────┴──────────────┴──────────────┴────────────┴──────────────────────┴──────────────────┴────────────────┴───────────────┴─────────────────┘
Tags are stored in their original types, so with enough SQL-fu you can find all spans with response size between X and Y bytes, for example.
The layout of the query is different from Elasticsearch, since now you have all tags for the trace laid out on a single view. This means that if you search for all operations of some service, you will get a cross-span result where one tag can match one span, and another tag can match another span. Consider the following trace:
+ upstream (tags: {"host": "foo.bar"})
++ upstream_ttfb (tags: {"status": 200})
++ upstream_download (tags: {"error": true})
You can search for host=foo.bar status=200
across all operations and this trace will be found, even though no since span has both tags. This seems like a really nice upside.
There's support for both JSON and Protobuf storage. The former allows out-of-band queries, since Clickhouse supports JSON functions. The latter is much more compact.
I pushed 100K spans from tracegen through this with a local Clickhouse in a Docker container with stock settings, and here's how storage looks like:
SELECT
table,
sum(marks) AS marks,
sum(rows) AS rows,
sum(bytes_on_disk) AS bytes_on_disk,
sum(data_compressed_bytes) AS data_compressed_bytes,
sum(data_uncompressed_bytes) AS data_uncompressed_bytes,
toDecimal64(data_uncompressed_bytes / data_compressed_bytes, 2) AS compression_ratio,
toDecimal64(data_compressed_bytes / rows, 2) AS compressed_bytes_per_row
FROM system.parts
WHERE table LIKE 'jaeger_%'
GROUP BY table
ORDER BY table ASC
SELECT
table,
sum(marks) AS marks,
sum(rows) AS rows,
sum(bytes_on_disk) AS bytes_on_disk,
sum(data_compressed_bytes) AS data_compressed_bytes,
sum(data_uncompressed_bytes) AS data_uncompressed_bytes,
toDecimal64(data_uncompressed_bytes / data_compressed_bytes, 2) AS compression_ratio,
toDecimal64(data_compressed_bytes / rows, 2) AS compressed_bytes_per_row
FROM system.parts
WHERE table LIKE 'jaeger_%'
GROUP BY table
ORDER BY table ASC
┌─table────────┬─marks─┬───rows─┬─bytes_on_disk─┬─data_compressed_bytes─┬─data_uncompressed_bytes─┬─compression_ratio─┬─compressed_bytes_per_row─┐
│ jaeger_index │ 16 │ 106667 │ 2121539 │ 2110986 │ 22678493 │ 10.74 │ 19.79 │
│ jaeger_spans │ 20 │ 106667 │ 5634663 │ 5632817 │ 37112272 │ 6.58 │ 52.80 │
└──────────────┴───────┴────────┴───────────────┴───────────────────────┴─────────────────────────┴───────────────────┴──────────────────────────┘
SELECT
table,
column,
type,
sum(column_data_compressed_bytes) AS compressed,
sum(column_data_uncompressed_bytes) AS uncompressed,
toDecimal64(uncompressed / compressed, 2) AS compression_ratio,
sum(rows) AS rows,
toDecimal64(compressed / rows, 2) AS bytes_per_row
FROM system.parts_columns
WHERE (table LIKE 'jaeger_%') AND active
GROUP BY
table,
column,
type
ORDER BY
table ASC,
column ASC
┌─table────────┬─column───────────┬─type──────────────────────────┬─compressed─┬─uncompressed─┬─compression_ratio─┬───rows─┬─bytes_per_row─┐
│ jaeger_index │ durationUs │ UInt64 │ 248303 │ 853336 │ 3.43 │ 106667 │ 2.32 │
│ jaeger_index │ operation │ LowCardinality(String) │ 5893 │ 107267 │ 18.20 │ 106667 │ 0.05 │
│ jaeger_index │ service │ LowCardinality(String) │ 977 │ 107086 │ 109.60 │ 106667 │ 0.00 │
│ jaeger_index │ tags.key │ Array(LowCardinality(String)) │ 29727 │ 1811980 │ 60.95 │ 106667 │ 0.27 │
│ jaeger_index │ tags.valueBool │ Array(UInt8) │ 29063 │ 1810904 │ 62.30 │ 106667 │ 0.27 │
│ jaeger_index │ tags.valueFloat │ Array(Float64) │ 44762 │ 8513880 │ 190.20 │ 106667 │ 0.41 │
│ jaeger_index │ tags.valueInt │ Array(Int64) │ 284393 │ 8513880 │ 29.93 │ 106667 │ 2.66 │
│ jaeger_index │ tags.valueString │ Array(LowCardinality(String)) │ 31695 │ 1814416 │ 57.24 │ 106667 │ 0.29 │
│ jaeger_index │ timestamp │ DateTime64(6) │ 431835 │ 853336 │ 1.97 │ 106667 │ 4.04 │
│ jaeger_index │ traceID │ FixedString(16) │ 1063375 │ 1706672 │ 1.60 │ 106667 │ 9.96 │
│ jaeger_spans │ model │ String │ 4264180 │ 34552264 │ 8.10 │ 106667 │ 39.97 │
│ jaeger_spans │ timestamp │ DateTime64(6) │ 463444 │ 853336 │ 1.84 │ 106667 │ 4.34 │
│ jaeger_spans │ traceID │ FixedString(16) │ 905193 │ 1706672 │ 1.88 │ 106667 │ 8.48 │
└──────────────┴──────────────────┴───────────────────────────────┴────────────┴──────────────┴───────────────────┴────────┴───────────────┘
We have around 74B daily docs in our production Elasticsearch storage. My plan is to switch that to fields-as-tags, remove indexing of non-queried fields (logs, nested tags, references), then switch to a sorted index and then see how Clickhouse compares to that for the same spans.
@bobrik very interesting, thanks for sharing. I am curious what the performance for retrieving by trace ID would be like.
Q: why do you use LowCardinality(String) for tags? Some tags can be very high cardinality, e.g. URLs.
You can search for
host=foo.bar status=200
across all operations and this trace will be found, even though no since span has both tags. This seems like a really nice upside.
I'm confused why this would be the case. Doesn't CH evaluate the query in full against each row (i.e. each span)? Or is this because how your plugin interacts with CH?
@yurishkuro even 100k per block cardinality, LowCardinality(String) with dictionary based encoding will better than just String
@yurishkuro retrieving by trace ID is pretty very fast, since you're doing a primary key lookup.
https://github.com/ClickHouse/ClickHouse/issues/4074#issuecomment-455189310 says this about LowCardinality
:
Rule of thumb: it should make benefits if the number of distinct values is less that few millions.
That said, the schema is in no way final.
I'm confused why this would be the case. Doesn't CH evaluate the query in full against each row (i.e. each span)? Or is this because how your plugin interacts with CH?
The row is not span, it's span-key-value combination. That's the key.
Take a look at the output of this query, which is an equivalent of what I do:
SELECT *
FROM jaeger_index
ARRAY JOIN tags
ORDER BY timestamp DESC
LIMIT 20
FORMAT PrettyCompactMonoBlock
┌──────────────────timestamp─┬─traceID──────────┬─service──────┬─operation─────┬─durationUs─┬─tags.key─────────────┬─tags.valueString─┬─tags.valueBool─┬─tags.valueInt─┬─tags.valueFloat─┐
│ 2020-05-11 04:53:54.299067 │ 3adb641936b21d98 │ jaeger-query │ getTraces │ 3414327 │ num_trace_ids │ │ 0 │ 20 │ 0 │
│ 2020-05-11 04:53:54.299067 │ 3adb641936b21d98 │ jaeger-query │ getTraces │ 3414327 │ weird │ │ 1 │ 0 │ 0 │
│ 2020-05-11 04:53:54.299067 │ 3adb641936b21d98 │ jaeger-query │ getTraces │ 3414327 │ π │ │ 0 │ 0 │ 3.14 │
│ 2020-05-11 04:53:54.299067 │ 3adb641936b21d98 │ jaeger-query │ getTraces │ 3414327 │ internal.span.format │ proto │ 0 │ 0 │ 0 │
│ 2020-05-11 04:53:54.299067 │ 3adb641936b21d98 │ jaeger-query │ getTraces │ 3414327 │ jaeger.version │ Go-2.22.1 │ 0 │ 0 │ 0 │
│ 2020-05-11 04:53:54.299067 │ 3adb641936b21d98 │ jaeger-query │ getTraces │ 3414327 │ hostname │ C02TV431HV2Q │ 0 │ 0 │ 0 │
│ 2020-05-11 04:53:54.299067 │ 3adb641936b21d98 │ jaeger-query │ getTraces │ 3414327 │ ip │ 192.168.1.43 │ 0 │ 0 │ 0 │
│ 2020-05-11 04:53:54.299067 │ 3adb641936b21d98 │ jaeger-query │ getTraces │ 3414327 │ client-uuid │ 3f9574079594605c │ 0 │ 0 │ 0 │
│ 2020-05-11 04:53:45.723921 │ 700a1bff0bdf3141 │ jaeger-query │ GetOperations │ 2268055 │ internal.span.format │ proto │ 0 │ 0 │ 0 │
│ 2020-05-11 04:53:45.723921 │ 700a1bff0bdf3141 │ jaeger-query │ GetOperations │ 2268055 │ jaeger.version │ Go-2.22.1 │ 0 │ 0 │ 0 │
└────────────────────────────┴──────────────────┴──────────────┴───────────────┴────────────┴──────────────────────┴──────────────────┴────────────────┴───────────────┴─────────────────┘
Here we joint traceID with nested tags, repeating every tag key value with corresponding trace.
Operation getTraces
does not happen multiple times in trace 3adb641936b21d98
, but our query makes a separate row for it, so that cross-span tags naturally work out of the box
Compare this to how it looks like at span level (this is what Elasticsearch works with):
SELECT *
FROM jaeger_index
WHERE (traceID = '3adb641936b21d98') AND (service = 'jaeger-query') AND (operation = 'getTraces')
┌──────────────────timestamp─┬─traceID──────────┬─service──────┬─operation─┬─durationUs─┬─tags.key────────────────────────────────────────────────────────────────────────────────────────────┬─tags.valueString────────────────────────────────────────────────────────────────┬─tags.valueBool────┬─tags.valueInt──────┬─tags.valueFloat──────┐
│ 2020-05-11 04:53:54.299067 │ 3adb641936b21d98 │ jaeger-query │ getTraces │ 3414327 │ ['num_trace_ids','weird','π','internal.span.format','jaeger.version','hostname','ip','client-uuid'] │ ['','','','proto','Go-2.22.1','C02TV431HV2Q','192.168.1.43','3f9574079594605c'] │ [0,1,0,0,0,0,0,0] │ [20,0,0,0,0,0,0,0] │ [0,0,3.14,0,0,0,0,0] │
└────────────────────────────┴──────────────────┴──────────────┴───────────┴────────────┴───────────────────────────────────────────────
Clickhouse doesn't allow direct lookups against nested fields like tags.ip=192.168.1.43
, instead you have to do a JOIN ARRAY
which results in this new property.
I think I might be more confused about tag queries that I initially realized, please take my explanation with a big grain of salt.
Turns out that having a sparse tags
with nested fields is not great for searches when you have a lot of spans, so went back to array of key=value
strings with a bloom filter index on top.
I've tried combining span storage and indexing into one table, but that proved to be very detrimental to performance of lookup by trace id. It compressed somewhat better, though.
If anyone wants to give it a spin, please be my guest:
It feels a lot snappier than Elasticsearch, and it's at least 2.5x more compact (4.4x if you compare to Elasticsearch behaviour out of the box).
@bobrik What stops Pull Request? Are you planning to do more optimization?
There were some issue with ClickHouse that prevented me from getting reasonable latency for the amount of data we have.
This is the main one:
See also:
There are also quite a few local changes that I haven't pushed yet.
I'll see if I can carve out some more time for this to push the latest changes, but no promises on ETA.
Another question.
Maybe it makes sense to divide the settings of tables jaeger_index_v2
and jaeger_spans_v2
into read and write?
This will create tables on the cluster as follows:
insert
-> jaeger_*_v2_buffer
-> jaeger_*_v2_distributed
-> jaeger_*_v2
select
-> jaeger_*_v2_distributed
-> jaeger_*_v2
This will take the load off Jaeger when it collects the batches (no need to make large batches on high load).
It seems that all the necessary changes have been made in the latest versions of ClickHouse. @bobrik Are you still working on the plugin?
Yes and no. The code is pretty solid and it's been running on a single node Clickhouse for the last month, but I'm waiting on an internal Cluster provisioning to finish to test it more broadly with real humans making queries. My plan is to add more tests and make a PR upstream after that.
The latest code here:
@bobrik Impressive. I can test this on my clickhouse cluster. A few question:
How is the performance compared to previous backend solutions you tried?
First, here's how stock Elasticsearch backend compares to our modified one (2x replication in both cases):
$ curl -s 'https://foo.baz/_cat/indices/jaeger-span-2020-05-20?s=index&v&h=index,pri,docs.count,store.size,segments.memory,segments.fixed_bitset_memory,fielddata.memory_size,memory.total'
index pri docs.count store.size segments.memory segments.fixed_bitset_memory fielddata.memory_size memory.total
jaeger-span-2020-05-20 38 81604373583 8.9tb 17.1gb 38gb 0b 17.3gb
$ curl 'https://foo.bar/_cat/indices/jaeger-span-2020-05-20?s=index&v&h=index,pri,docs.count,store.size,segments.memory,segments.fixed_bitset_memory,fielddata.memory_size,memory.total'
index pri docs.count store.size segments.memory segments.fixed_bitset_memory fielddata.memory_size memory.total
jaeger-span-2020-05-20 38 8406585575 4.8tb 192.1mb 0b 0b 192.1mb
We don't use nested docs and sort the index, which:
As you can see, this was back in May. Now we can compare improved Elasticsearch to Clickhouse:
$ curl 'https://foo.bar/_cat/indices/jaeger-span-2020-08-28?s=index&v&h=index,pri,docs.count,store.size,segments.memory,segments.fixed_bitset_memory,fielddata.memory_size,memory.total'
index pri docs.count store.size segments.memory segments.fixed_bitset_memory fielddata.memory_size memory.total
jaeger-span-2020-08-28 38 13723317165 8.1tb 331.2mb 0b 0b 916.1mb
SELECT
table,
partition,
sum(marks) AS marks,
sum(rows) AS rows,
formatReadableSize(sum(data_compressed_bytes)) AS compressed_bytes,
formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed_bytes,
toDecimal64(sum(data_uncompressed_bytes) / sum(data_compressed_bytes), 2) AS compression_ratio,
formatReadableSize(sum(data_compressed_bytes) / rows) AS bytes_per_row,
formatReadableSize(sum(primary_key_bytes_in_memory)) AS pk_in_memory
FROM system.parts
WHERE (table IN ('jaeger_index_v2', 'jaeger_spans_v2', 'jaeger_archive_spans_v2', '.inner.jaeger_operations_v2'))
AND active
AND partition = '2020-08-28'
GROUP BY table, partition
ORDER BY table ASC, partition ASC
┌─table───────────┬─partition──┬────marks─┬────────rows─┬─compressed_bytes─┬─uncompressed_bytes─┬─compression_ratio─┬─bytes_per_row─┬─pk_in_memory─┐
│ jaeger_index_v2 │ 2020-08-28 │ 13401691 │ 13723301235 │ 294.31 GiB │ 2.75 TiB │ 9.56 │ 23.03 B │ 115.04 MiB │
│ jaeger_spans_v2 │ 2020-08-28 │ 13401693 │ 13723301235 │ 757.75 GiB │ 4.55 TiB │ 6.14 │ 59.29 B │ 358.88 MiB │
└─────────────────┴────────────┴──────────┴─────────────┴──────────────────┴────────────────────┴───────────────────┴───────────────┴──────────────┘
How to configure the jaeger collector to use clickhouse?
Set SPAN_STORAGE_TYPE=clickhouse
env variable and then --clickhouse.datasource
to point to your Clickhouse database URL, for example: tcp://localhost:9000?database=jaeger
.
Will jaeger collector create the tables?
No, there are multitudes of ways to organize tables and you may want to tweak table settings yourself, that's why I only provide an example starting schema. This is also part of the reason I want to test on our production cluster rather than on a single node before I make a PR.
Please consult the docs: https://github.com/bobrik/jaeger/tree/ivan/clickhouse/plugin/storage/clickhouse#schema
Keep in mind that you need at least Clickhouse v20.7.1.4189 to have reasonable performance.
it's very promising ! you've try to use chproxy in for caching request ? or tricksterproxy who are optimized for request by time range...
The first and second proxy is not suitable because both use the HTTP protocol. I'm now thinking about to start the controller before each Clickhouse. And defining on each agent a list of all controllers at the time of deployment. Otherwise, there are will have to do TCP balancing.
The first and second proxy is not suitable because both use the HTTP protocol.
hooo your backend use the native tcp protocol of clickhouse !
yes you use: https://github.com/ClickHouse/clickhouse-go that use native protocol !
How is the performance compared to previous backend solutions you tried?
First, here's how stock Elasticsearch backend compares to our modified one (2x replication in both cases):
$ curl -s 'https://foo.baz/_cat/indices/jaeger-span-2020-05-20?s=index&v&h=index,pri,docs.count,store.size,segments.memory,segments.fixed_bitset_memory,fielddata.memory_size,memory.total' index pri docs.count store.size segments.memory segments.fixed_bitset_memory fielddata.memory_size memory.total jaeger-span-2020-05-20 38 81604373583 8.9tb 17.1gb 38gb 0b 17.3gb $ curl 'https://foo.bar/_cat/indices/jaeger-span-2020-05-20?s=index&v&h=index,pri,docs.count,store.size,segments.memory,segments.fixed_bitset_memory,fielddata.memory_size,memory.total' index pri docs.count store.size segments.memory segments.fixed_bitset_memory fielddata.memory_size memory.total jaeger-span-2020-05-20 38 8406585575 4.8tb 192.1mb 0b 0b 192.1mb
We don't use nested docs and sort the index, which:
- Removes an issue when shards run out of docs (2147483519 is the limit)
- Disk space usage is 2x more efficient, 312 bytes per span
- Index memory usage is down from almost 17.3GiB to just 0.2GiB
- Bitset memory usage is down from 38GiB to just 0B
As you can see, this was back in May. Now we can compare improved Elasticsearch to Clickhouse:
$ curl 'https://foo.bar/_cat/indices/jaeger-span-2020-08-28?s=index&v&h=index,pri,docs.count,store.size,segments.memory,segments.fixed_bitset_memory,fielddata.memory_size,memory.total' index pri docs.count store.size segments.memory segments.fixed_bitset_memory fielddata.memory_size memory.total jaeger-span-2020-08-28 38 13723317165 8.1tb 331.2mb 0b 0b 916.1mb
SELECT table, partition, sum(marks) AS marks, sum(rows) AS rows, formatReadableSize(sum(data_compressed_bytes)) AS compressed_bytes, formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed_bytes, toDecimal64(sum(data_uncompressed_bytes) / sum(data_compressed_bytes), 2) AS compression_ratio, formatReadableSize(sum(data_compressed_bytes) / rows) AS bytes_per_row, formatReadableSize(sum(primary_key_bytes_in_memory)) AS pk_in_memory FROM system.parts WHERE (table IN ('jaeger_index_v2', 'jaeger_spans_v2', 'jaeger_archive_spans_v2', '.inner.jaeger_operations_v2')) AND active AND partition = '2020-08-28' GROUP BY table, partition ORDER BY table ASC, partition ASC
┌─table───────────┬─partition──┬────marks─┬────────rows─┬─compressed_bytes─┬─uncompressed_bytes─┬─compression_ratio─┬─bytes_per_row─┬─pk_in_memory─┐ │ jaeger_index_v2 │ 2020-08-28 │ 13401691 │ 13723301235 │ 294.31 GiB │ 2.75 TiB │ 9.56 │ 23.03 B │ 115.04 MiB │ │ jaeger_spans_v2 │ 2020-08-28 │ 13401693 │ 13723301235 │ 757.75 GiB │ 4.55 TiB │ 6.14 │ 59.29 B │ 358.88 MiB │ └─────────────────┴────────────┴──────────┴─────────────┴──────────────────┴────────────────────┴───────────────────┴───────────────┴──────────────┘
- Disk usage is down from 4.0TiB to 1.0TiB, 4x (bytes per span is down from 324 to 82)
- Memory usage is roughly the same
- Search performance is roughly the same, but I'm comparing single Clickhouse node to 48 Elasticsearch nodes
- Trace lookup is instantaneous, since it's a primary key lookup
How to configure the jaeger collector to use clickhouse?
Set
SPAN_STORAGE_TYPE=clickhouse
env variable and then--clickhouse.datasource
to point to your Clickhouse database URL, for example:tcp://localhost:9000?database=jaeger
.Will jaeger collector create the tables?
No, there are multitudes of ways to organize tables and you may want to tweak table settings yourself, that's why I only provide an example starting schema. This is also part of the reason I want to test on our production cluster rather than on a single node before I make a PR.
Please consult the docs: https://github.com/bobrik/jaeger/tree/ivan/clickhouse/plugin/storage/clickhouse#schema
Keep in mind that you need at least Clickhouse v20.7.1.4189 to have reasonable performance.
@bobrik hi, I don't understand what this field model String CODEC(ZSTD(3))
means in the jaeger_spans_v2
table.
https://github.com/bobrik/jaeger/tree/ivan/clickhouse/plugin/storage/clickhouse#schema
I have another question, how to store the span tags value?
We have been testing the Clickhouse plugin for almost 2 months. So far, the load on each cluster is not very large ~ 1 billion spans per day. For such a configuration as here in the example, it is not noticeable at all. Maybe someone else will be interested to test Jaeger with Clickhoause https://github.com/levonet/docker-jaeger
I really want the Clickhouse plugin to finally be released in the Jaeger.
I really want the Clickhouse plugin to finally be released in the Jaeger.
It is unlikely that we'll add any other storage mechanism to the core repository in the near future. What will probably happen is that we'll change the architecture of how this works for Jaeger v2, making it easier to plug your own storage plugin.
@jpkrohling How long to wait for Jaeger v2? And will a separate organization or repository be created for the community?
Jaeger v2 is at least a few months away, in the best case scenario.
And will a separate organization or repository be created for the community?
Not sure I understand. Do you mean for the externally contributed storage mechanisms? That's a great question, although a bit early to ask :-)
I would be -1 on creating such a repo under Jaeger org, because it creates an expectation that project maintainers are responsible for it. There's nothing wrong with having a plugin repo elsewhere, under the org of the real authors/maintainers, as is the case with InfluxDB and Logz.io plugins. If such repo is well-maintained, we can link to it from the official docs.
Good to know about v2 plans with regards to plugins.. I'll revisit a more official place for the Clickhouse plugin when v2 is ready, but for now if anyone wants to use the current implementation, feel free to apply patches from my branch:
We're using it internally at Cloudflare.
@bobrik to clarify, you implemented CH storage provider as a built-in component in a fork, not as gRPC plugin that can be attached at runtime, correct?
@yurishkuro correct.
How is the performance compared to previous backend solutions you tried?
First, here's how stock Elasticsearch backend compares to our modified one (2x replication in both cases):
$ curl -s 'https://foo.baz/_cat/indices/jaeger-span-2020-05-20?s=index&v&h=index,pri,docs.count,store.size,segments.memory,segments.fixed_bitset_memory,fielddata.memory_size,memory.total' index pri docs.count store.size segments.memory segments.fixed_bitset_memory fielddata.memory_size memory.total jaeger-span-2020-05-20 38 81604373583 8.9tb 17.1gb 38gb 0b 17.3gb $ curl 'https://foo.bar/_cat/indices/jaeger-span-2020-05-20?s=index&v&h=index,pri,docs.count,store.size,segments.memory,segments.fixed_bitset_memory,fielddata.memory_size,memory.total' index pri docs.count store.size segments.memory segments.fixed_bitset_memory fielddata.memory_size memory.total jaeger-span-2020-05-20 38 8406585575 4.8tb 192.1mb 0b 0b 192.1mb
We don't use nested docs and sort the index, which:
- Removes an issue when shards run out of docs (2147483519 is the limit)
- Disk space usage is 2x more efficient, 312 bytes per span
- Index memory usage is down from almost 17.3GiB to just 0.2GiB
- Bitset memory usage is down from 38GiB to just 0B
As you can see, this was back in May. Now we can compare improved Elasticsearch to Clickhouse:
$ curl 'https://foo.bar/_cat/indices/jaeger-span-2020-08-28?s=index&v&h=index,pri,docs.count,store.size,segments.memory,segments.fixed_bitset_memory,fielddata.memory_size,memory.total' index pri docs.count store.size segments.memory segments.fixed_bitset_memory fielddata.memory_size memory.total jaeger-span-2020-08-28 38 13723317165 8.1tb 331.2mb 0b 0b 916.1mb
SELECT table, partition, sum(marks) AS marks, sum(rows) AS rows, formatReadableSize(sum(data_compressed_bytes)) AS compressed_bytes, formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed_bytes, toDecimal64(sum(data_uncompressed_bytes) / sum(data_compressed_bytes), 2) AS compression_ratio, formatReadableSize(sum(data_compressed_bytes) / rows) AS bytes_per_row, formatReadableSize(sum(primary_key_bytes_in_memory)) AS pk_in_memory FROM system.parts WHERE (table IN ('jaeger_index_v2', 'jaeger_spans_v2', 'jaeger_archive_spans_v2', '.inner.jaeger_operations_v2')) AND active AND partition = '2020-08-28' GROUP BY table, partition ORDER BY table ASC, partition ASC
┌─table───────────┬─partition──┬────marks─┬────────rows─┬─compressed_bytes─┬─uncompressed_bytes─┬─compression_ratio─┬─bytes_per_row─┬─pk_in_memory─┐ │ jaeger_index_v2 │ 2020-08-28 │ 13401691 │ 13723301235 │ 294.31 GiB │ 2.75 TiB │ 9.56 │ 23.03 B │ 115.04 MiB │ │ jaeger_spans_v2 │ 2020-08-28 │ 13401693 │ 13723301235 │ 757.75 GiB │ 4.55 TiB │ 6.14 │ 59.29 B │ 358.88 MiB │ └─────────────────┴────────────┴──────────┴─────────────┴──────────────────┴────────────────────┴───────────────────┴───────────────┴──────────────┘
- Disk usage is down from 4.0TiB to 1.0TiB, 4x (bytes per span is down from 324 to 82)
- Memory usage is roughly the same
- Search performance is roughly the same, but I'm comparing single Clickhouse node to 48 Elasticsearch nodes
- Trace lookup is instantaneous, since it's a primary key lookup
How to configure the jaeger collector to use clickhouse?
Set
SPAN_STORAGE_TYPE=clickhouse
env variable and then--clickhouse.datasource
to point to your Clickhouse database URL, for example:tcp://localhost:9000?database=jaeger
.Will jaeger collector create the tables?
No, there are multitudes of ways to organize tables and you may want to tweak table settings yourself, that's why I only provide an example starting schema. This is also part of the reason I want to test on our production cluster rather than on a single node before I make a PR.
Please consult the docs: https://github.com/bobrik/jaeger/tree/ivan/clickhouse/plugin/storage/clickhouse#schema
Keep in mind that you need at least Clickhouse v20.7.1.4189 to have reasonable performance.
@bobrik Did you get to run this on a real CH cluster? I would imagine if the sharding is done on a CH cluster the search and ingestion performance should have been improved.
Yes, we run it on a 3 node cluster with 3x replication these days.
Yes, we run it on a 3 node cluster with 3x replication these days.
So how does this then compare to ElasticSearch equivalent. In terms of infrastructure need for back-ends ClickHouse
vs ElasticSearch
Ex you mentioned previously you were comparing Search performance is roughly the same, but I'm comparing single Clickhouse node to 48 Elasticsearch nodes
And I would like to clarify that as per your statement 1 CH server performs as 48 ES nodes
in terms of total infrastructure to meet storage, search and ingestion aspect of performance?
In addition do you mind sharing the total user-base of ES vs CH?
Any additional info you can supply is greatly appreciated!
In terms of cluster size: ES was 48 nodes, ClickHouse is 3 nodes. Same amount of spans ingested in each: ~250K spans/s.
Same queries and the same user base for both, since both serve the same UI endpoint (ES served it before, ClickHouse does it now).
Tracing wasn't the only thing in the ES cluster, but it was by far the biggest and occupied more than half of resources.
@bobrik Are you still working on this issue? I find some issues with you forked version, and I have some commits to fix them. Should I submit a PR to you?
@bobrik Very nice, much more efficient, but there isn't a scale-out mechanism really for ClickHouse which is a big concern for us. You also wouldn't have Kibana to query it, which is pretty useful for dashboarding and analytics. Here is a dashboard we use from Kibana (this is demo data).
Maybe there is interest in building a storage plugin within Jaeger v2 to make this more maintainable?
@jkowall why you think ClickHouse doesn't have scale-out? It near-linear scalability available in ClickHouse itself by design
just add shard with hosts to <remote_servers>
in /etc/clickhouse-server/config.d/remote_servers.xml
and run CREATE TABLE ... Engine=Distributed(cluster_name, db.table, sharding_key_expression) ON CLUSTER ... ;
after it you can insert into any host and data will spreaded via sharding key, or you can use chproxy \ haproxy to data ingestion directly into MergeTree table
also you can read from all servers with smarter aggregation from Distributed table
@Slach yes I saw that, but if you have hundreds of nodes it can be an issue I would assume. We run thousands of nodes of ES. Bigger issue would be supporting ClickHouse in Kibana, but that's for another project :)
@mcspring I'm not actively working on it at the moment, since it doesn't require any maintenance from me. Feel free to submit a PR and I'll get to it eventually.
@jkowall try to use https://github.com/Altinity/clickhouse-operator/
It is unlikely that we'll add any other storage mechanism to the core repository in the near future. What will probably happen is that we'll change the architecture of how this works for Jaeger v2, making it easier to plug your own storage plugin.
@jpkrohling is this still the case? https://github.com/jaegertracing/jaeger/issues/638 is still open. Are you saying this should really be closed because a completely different approach is being pursued?
making it easier to plug your own storage plugin.
I tried finding an issue for this, but could not find it. Is there an issue for this? Thanks!
@otisg don't know about the issue, but the approach is that we're rebuilding Jaeger backend on top of OpenTelemetry collector, which has a mechanism of combining various packages (even from different repos) at build time without explicit code dependencies in the core. That means the storage implementations can directly implement storage interface and don't have to go through much less efficient grpc-plugin interface.
I don't think we have an issue tracking this specific feature, but we know it will be there for v2, which is being tracked here: https://github.com/jaegertracing/jaeger/milestone/15
Based on personal experience, i would like to recommend adding clickhouse as one of core storage options for jaeger (like elasticsearch and cassandra), if not, we should probably port Ivan's work as a gRPC plugin.
There are few benefits I can think of:
Uber's migration of their logging pipeline from ES to clickhouse[0] is a very good example of clickhouse performance.
I also like the idea of adding a plugin as one of the core storage options for jaeger. This will allow to organically add a clickhouse to existing jaeger-operator and helm-charts.
We have been using Ivan's plugin for half a year. The use of a clickhouse as storage is almost invisible in terms of infrastructure resources. Docker images with an updated version of jaeger and a startup example can be found here https://github.com/levonet/docker-jaeger if anyone is interested.
I have migrated https://github.com/bobrik/jaeger/tree/ivan/clickhouse/plugin/storage/clickhouse to storage plugin. The source code is available here https://github.com/pavolloffay/jaeger-clickhouse.
For now I am not planning on maintaining it, I have built it for our internal purposes. However if there is interest and somebody would like to help to maintain it I am happy to chat about it.
The repository contains instructions how to run it and the code was tested locally (see readme). When I run the code on more data I get Too many simultaneous queries. Maximum: 100
, see https://github.com/pavolloffay/jaeger-clickhouse/issues/2. It might be just DB configuration issue.
ClickHouse, an open-source column-oriented DBMS designed initially for real-time analytics and mostly write-once/read-many big-data use cases, can be used as a very efficient log and trace storage.
Meta issue: #638 Additional storage backends