ClickHouse as a storage backend

sboisson commented 5 years ago

ClickHouse, an open-source column-oriented DBMS designed initially for real-time analytics and mostly write-once/read-many big-data use cases, can be used as a very efficient log and trace storage.

Meta issue: #638 Additional storage backends

bzon commented 5 years ago

Is anyone working on this? I or my team can maybe give a shot at this.

Slach commented 5 years ago

as i now noone working on this

Slach commented 5 years ago

@bzon i can join you as tester

bzon commented 5 years ago

@bzon i can join you as tester

Sure!

sboisson commented 5 years ago

Happy to see someone working on this :) Might be able to join as tester

yurishkuro commented 5 years ago

@bzon would be good if you post an architecture here, specifically how you would lay out the data, ingestion, etc. To my knowledge, clickhouse requires batched writes, and it may even be up to you to decide which node to send the writes to, so there are many questions. It may require some benchmarking to find the optimal design.

bzon commented 5 years ago

@yurishkuro at the moment, we have zero knowledge with the internals of jaeger. I think that benchmarking should be the first step to see if this integration is feasible. And with that said, the first requirement should be creating the right table schema.

sboisson commented 5 years ago

I think architecture of project https://github.com/flant/loghouse could be a source of inspiration…

sboisson commented 4 years ago

This webinar could be interesting: A Practical Introduction to Handling Log Data in ClickHouse

bobrik commented 4 years ago

I took a stab at it (very early WIP):

https://github.com/jaegertracing/jaeger/compare/master...bobrik:ivan/clickhouse

This is the schema I used:

Index table for fast searches (haven't measured if indices are useful yet)

CREATE TABLE jaeger_index (
  timestamp DateTime64(6),
  traceID FixedString(16),
  service LowCardinality(String),
  operation LowCardinality(String),
  durationUs UInt64,
  tags Nested(
    key LowCardinality(String),
    valueString LowCardinality(String),
    valueBool UInt8,
    valueInt Int64,
    valueFloat Float64
  ),
  INDEX tags_strings (tags.key, tags.valueString) TYPE set(0) GRANULARITY 64,
  INDEX tags_ints (tags.key, tags.valueInt) TYPE set(0) GRANULARITY 64
) ENGINE MergeTree() PARTITION BY toDate(timestamp) ORDER BY (timestamp, service, operation);

Data table for span storage and quick retrieval by traceID

CREATE TABLE jaeger_spans (
  timestamp DateTime64(6),
  traceID FixedString(16),
  model String
) ENGINE MergeTree() PARTITION BY toDate(timestamp) ORDER BY traceID;

You probably need Clickhouse 20.x for DateTime64, I used 20.1.11.73.

Index table looks like this:

SELECT *
FROM jaeger_index
ARRAY JOIN tags
ORDER BY timestamp DESC
LIMIT 20
FORMAT PrettyCompactMonoBlock

┌──────────────────timestamp─┬─traceID──────────┬─service──────┬─operation────┬─durationUs─┬─tags.key─────────────┬─tags.valueString─┬─tags.valueBool─┬─tags.valueInt─┬─tags.valueFloat─┐
│ 2020-05-10 20:43:23.248314 │ 212e5c616f4b9c2f │ jaeger-query │ getTraces    │     131605 │ num_trace_ids        │                  │              0 │            13 │               0 │
│ 2020-05-10 20:43:23.248314 │ 212e5c616f4b9c2f │ jaeger-query │ getTraces    │     131605 │ weird                │                  │              1 │             0 │               0 │
│ 2020-05-10 20:43:23.248314 │ 212e5c616f4b9c2f │ jaeger-query │ getTraces    │     131605 │ π                    │                  │              0 │             0 │            3.14 │
│ 2020-05-10 20:43:23.248314 │ 212e5c616f4b9c2f │ jaeger-query │ getTraces    │     131605 │ internal.span.format │ proto            │              0 │             0 │               0 │
│ 2020-05-10 20:43:23.248314 │ 212e5c616f4b9c2f │ jaeger-query │ getTraces    │     131605 │ jaeger.version       │ Go-2.22.1        │              0 │             0 │               0 │
│ 2020-05-10 20:43:23.248314 │ 212e5c616f4b9c2f │ jaeger-query │ getTraces    │     131605 │ hostname             │ C02TV431HV2Q     │              0 │             0 │               0 │
│ 2020-05-10 20:43:23.248314 │ 212e5c616f4b9c2f │ jaeger-query │ getTraces    │     131605 │ ip                   │ 192.168.1.43     │              0 │             0 │               0 │
│ 2020-05-10 20:43:23.248314 │ 212e5c616f4b9c2f │ jaeger-query │ getTraces    │     131605 │ client-uuid          │ 7fc8f98ddbcd358c │              0 │             0 │               0 │
│ 2020-05-10 20:43:23.065577 │ 212e5c616f4b9c2f │ jaeger-query │ FindTraceIDs │     182728 │ internal.span.format │ proto            │              0 │             0 │               0 │
│ 2020-05-10 20:43:23.065577 │ 212e5c616f4b9c2f │ jaeger-query │ FindTraceIDs │     182728 │ jaeger.version       │ Go-2.22.1        │              0 │             0 │               0 │
│ 2020-05-10 20:43:23.065577 │ 212e5c616f4b9c2f │ jaeger-query │ FindTraceIDs │     182728 │ hostname             │ C02TV431HV2Q     │              0 │             0 │               0 │
│ 2020-05-10 20:43:23.065577 │ 212e5c616f4b9c2f │ jaeger-query │ FindTraceIDs │     182728 │ ip                   │ 192.168.1.43     │              0 │             0 │               0 │
│ 2020-05-10 20:43:23.065577 │ 212e5c616f4b9c2f │ jaeger-query │ FindTraceIDs │     182728 │ client-uuid          │ 7fc8f98ddbcd358c │              0 │             0 │               0 │
│ 2020-05-10 20:43:23.065574 │ 212e5c616f4b9c2f │ jaeger-query │ FindTraces   │     314349 │ internal.span.format │ proto            │              0 │             0 │               0 │
│ 2020-05-10 20:43:23.065574 │ 212e5c616f4b9c2f │ jaeger-query │ FindTraces   │     314349 │ jaeger.version       │ Go-2.22.1        │              0 │             0 │               0 │
│ 2020-05-10 20:43:23.065574 │ 212e5c616f4b9c2f │ jaeger-query │ FindTraces   │     314349 │ hostname             │ C02TV431HV2Q     │              0 │             0 │               0 │
│ 2020-05-10 20:43:23.065574 │ 212e5c616f4b9c2f │ jaeger-query │ FindTraces   │     314349 │ ip                   │ 192.168.1.43     │              0 │             0 │               0 │
│ 2020-05-10 20:43:23.065574 │ 212e5c616f4b9c2f │ jaeger-query │ FindTraces   │     314349 │ client-uuid          │ 7fc8f98ddbcd358c │              0 │             0 │               0 │
│ 2020-05-10 20:43:23.065535 │ 212e5c616f4b9c2f │ jaeger-query │ /api/traces  │     315554 │ sampler.type         │ const            │              0 │             0 │               0 │
│ 2020-05-10 20:43:23.065535 │ 212e5c616f4b9c2f │ jaeger-query │ /api/traces  │     315554 │ sampler.param        │                  │              1 │             0 │               0 │
└────────────────────────────┴──────────────────┴──────────────┴──────────────┴────────────┴──────────────────────┴──────────────────┴────────────────┴───────────────┴─────────────────┘

Tags are stored in their original types, so with enough SQL-fu you can find all spans with response size between X and Y bytes, for example.

The layout of the query is different from Elasticsearch, since now you have all tags for the trace laid out on a single view. This means that if you search for all operations of some service, you will get a cross-span result where one tag can match one span, and another tag can match another span. Consider the following trace:

+ upstream (tags: {"host": "foo.bar"})
++ upstream_ttfb (tags: {"status": 200})
++ upstream_download (tags: {"error": true})

You can search for host=foo.bar status=200 across all operations and this trace will be found, even though no since span has both tags. This seems like a really nice upside.

There's support for both JSON and Protobuf storage. The former allows out-of-band queries, since Clickhouse supports JSON functions. The latter is much more compact.

I pushed 100K spans from tracegen through this with a local Clickhouse in a Docker container with stock settings, and here's how storage looks like:

SELECT
    table,
    sum(marks) AS marks,
    sum(rows) AS rows,
    sum(bytes_on_disk) AS bytes_on_disk,
    sum(data_compressed_bytes) AS data_compressed_bytes,
    sum(data_uncompressed_bytes) AS data_uncompressed_bytes,
    toDecimal64(data_uncompressed_bytes / data_compressed_bytes, 2) AS compression_ratio,
    toDecimal64(data_compressed_bytes / rows, 2) AS compressed_bytes_per_row
FROM system.parts
WHERE table LIKE 'jaeger_%'
GROUP BY table
ORDER BY table ASC

SELECT
    table,
    sum(marks) AS marks,
    sum(rows) AS rows,
    sum(bytes_on_disk) AS bytes_on_disk,
    sum(data_compressed_bytes) AS data_compressed_bytes,
    sum(data_uncompressed_bytes) AS data_uncompressed_bytes,
    toDecimal64(data_uncompressed_bytes / data_compressed_bytes, 2) AS compression_ratio,
    toDecimal64(data_compressed_bytes / rows, 2) AS compressed_bytes_per_row
FROM system.parts
WHERE table LIKE 'jaeger_%'
GROUP BY table
ORDER BY table ASC

┌─table────────┬─marks─┬───rows─┬─bytes_on_disk─┬─data_compressed_bytes─┬─data_uncompressed_bytes─┬─compression_ratio─┬─compressed_bytes_per_row─┐
│ jaeger_index │    16 │ 106667 │       2121539 │               2110986 │                22678493 │             10.74 │                    19.79 │
│ jaeger_spans │    20 │ 106667 │       5634663 │               5632817 │                37112272 │              6.58 │                    52.80 │
└──────────────┴───────┴────────┴───────────────┴───────────────────────┴─────────────────────────┴───────────────────┴──────────────────────────┘

SELECT
    table,
    column,
    type,
    sum(column_data_compressed_bytes) AS compressed,
    sum(column_data_uncompressed_bytes) AS uncompressed,
    toDecimal64(uncompressed / compressed, 2) AS compression_ratio,
    sum(rows) AS rows,
    toDecimal64(compressed / rows, 2) AS bytes_per_row
FROM system.parts_columns
WHERE (table LIKE 'jaeger_%') AND active
GROUP BY
    table,
    column,
    type
ORDER BY
    table ASC,
    column ASC

┌─table────────┬─column───────────┬─type──────────────────────────┬─compressed─┬─uncompressed─┬─compression_ratio─┬───rows─┬─bytes_per_row─┐
│ jaeger_index │ durationUs       │ UInt64                        │     248303 │       853336 │              3.43 │ 106667 │          2.32 │
│ jaeger_index │ operation        │ LowCardinality(String)        │       5893 │       107267 │             18.20 │ 106667 │          0.05 │
│ jaeger_index │ service          │ LowCardinality(String)        │        977 │       107086 │            109.60 │ 106667 │          0.00 │
│ jaeger_index │ tags.key         │ Array(LowCardinality(String)) │      29727 │      1811980 │             60.95 │ 106667 │          0.27 │
│ jaeger_index │ tags.valueBool   │ Array(UInt8)                  │      29063 │      1810904 │             62.30 │ 106667 │          0.27 │
│ jaeger_index │ tags.valueFloat  │ Array(Float64)                │      44762 │      8513880 │            190.20 │ 106667 │          0.41 │
│ jaeger_index │ tags.valueInt    │ Array(Int64)                  │     284393 │      8513880 │             29.93 │ 106667 │          2.66 │
│ jaeger_index │ tags.valueString │ Array(LowCardinality(String)) │      31695 │      1814416 │             57.24 │ 106667 │          0.29 │
│ jaeger_index │ timestamp        │ DateTime64(6)                 │     431835 │       853336 │              1.97 │ 106667 │          4.04 │
│ jaeger_index │ traceID          │ FixedString(16)               │    1063375 │      1706672 │              1.60 │ 106667 │          9.96 │
│ jaeger_spans │ model            │ String                        │    4264180 │     34552264 │              8.10 │ 106667 │         39.97 │
│ jaeger_spans │ timestamp        │ DateTime64(6)                 │     463444 │       853336 │              1.84 │ 106667 │          4.34 │
│ jaeger_spans │ traceID          │ FixedString(16)               │     905193 │      1706672 │              1.88 │ 106667 │          8.48 │
└──────────────┴──────────────────┴───────────────────────────────┴────────────┴──────────────┴───────────────────┴────────┴───────────────┘

We have around 74B daily docs in our production Elasticsearch storage. My plan is to switch that to fields-as-tags, remove indexing of non-queried fields (logs, nested tags, references), then switch to a sorted index and then see how Clickhouse compares to that for the same spans.

yurishkuro commented 4 years ago

@bobrik very interesting, thanks for sharing. I am curious what the performance for retrieving by trace ID would be like.

Q: why do you use LowCardinality(String) for tags? Some tags can be very high cardinality, e.g. URLs.

You can search for host=foo.bar status=200 across all operations and this trace will be found, even though no since span has both tags. This seems like a really nice upside.

I'm confused why this would be the case. Doesn't CH evaluate the query in full against each row (i.e. each span)? Or is this because how your plugin interacts with CH?

Slach commented 4 years ago

@yurishkuro even 100k per block cardinality, LowCardinality(String) with dictionary based encoding will better than just String

bobrik commented 4 years ago

@yurishkuro retrieving by trace ID is pretty very fast, since you're doing a primary key lookup.

https://github.com/ClickHouse/ClickHouse/issues/4074#issuecomment-455189310 says this about LowCardinality:

Rule of thumb: it should make benefits if the number of distinct values is less that few millions.

That said, the schema is in no way final.

I'm confused why this would be the case. Doesn't CH evaluate the query in full against each row (i.e. each span)? Or is this because how your plugin interacts with CH?

The row is not span, it's span-key-value combination. That's the key.

Take a look at the output of this query, which is an equivalent of what I do:

SELECT *
FROM jaeger_index
ARRAY JOIN tags
ORDER BY timestamp DESC
LIMIT 20
FORMAT PrettyCompactMonoBlock

┌──────────────────timestamp─┬─traceID──────────┬─service──────┬─operation─────┬─durationUs─┬─tags.key─────────────┬─tags.valueString─┬─tags.valueBool─┬─tags.valueInt─┬─tags.valueFloat─┐
│ 2020-05-11 04:53:54.299067 │ 3adb641936b21d98 │ jaeger-query │ getTraces     │    3414327 │ num_trace_ids        │                  │              0 │            20 │               0 │
│ 2020-05-11 04:53:54.299067 │ 3adb641936b21d98 │ jaeger-query │ getTraces     │    3414327 │ weird                │                  │              1 │             0 │               0 │
│ 2020-05-11 04:53:54.299067 │ 3adb641936b21d98 │ jaeger-query │ getTraces     │    3414327 │ π                    │                  │              0 │             0 │            3.14 │
│ 2020-05-11 04:53:54.299067 │ 3adb641936b21d98 │ jaeger-query │ getTraces     │    3414327 │ internal.span.format │ proto            │              0 │             0 │               0 │
│ 2020-05-11 04:53:54.299067 │ 3adb641936b21d98 │ jaeger-query │ getTraces     │    3414327 │ jaeger.version       │ Go-2.22.1        │              0 │             0 │               0 │
│ 2020-05-11 04:53:54.299067 │ 3adb641936b21d98 │ jaeger-query │ getTraces     │    3414327 │ hostname             │ C02TV431HV2Q     │              0 │             0 │               0 │
│ 2020-05-11 04:53:54.299067 │ 3adb641936b21d98 │ jaeger-query │ getTraces     │    3414327 │ ip                   │ 192.168.1.43     │              0 │             0 │               0 │
│ 2020-05-11 04:53:54.299067 │ 3adb641936b21d98 │ jaeger-query │ getTraces     │    3414327 │ client-uuid          │ 3f9574079594605c │              0 │             0 │               0 │
│ 2020-05-11 04:53:45.723921 │ 700a1bff0bdf3141 │ jaeger-query │ GetOperations │    2268055 │ internal.span.format │ proto            │              0 │             0 │               0 │
│ 2020-05-11 04:53:45.723921 │ 700a1bff0bdf3141 │ jaeger-query │ GetOperations │    2268055 │ jaeger.version       │ Go-2.22.1        │              0 │             0 │               0 │
└────────────────────────────┴──────────────────┴──────────────┴───────────────┴────────────┴──────────────────────┴──────────────────┴────────────────┴───────────────┴─────────────────┘

Here we joint traceID with nested tags, repeating every tag key value with corresponding trace.

Operation getTraces does not happen multiple times in trace 3adb641936b21d98, but our query makes a separate row for it, so that cross-span tags naturally work out of the box

Compare this to how it looks like at span level (this is what Elasticsearch works with):

SELECT *
FROM jaeger_index
WHERE (traceID = '3adb641936b21d98') AND (service = 'jaeger-query') AND (operation = 'getTraces')

┌──────────────────timestamp─┬─traceID──────────┬─service──────┬─operation─┬─durationUs─┬─tags.key────────────────────────────────────────────────────────────────────────────────────────────┬─tags.valueString────────────────────────────────────────────────────────────────┬─tags.valueBool────┬─tags.valueInt──────┬─tags.valueFloat──────┐
│ 2020-05-11 04:53:54.299067 │ 3adb641936b21d98 │ jaeger-query │ getTraces │    3414327 │ ['num_trace_ids','weird','π','internal.span.format','jaeger.version','hostname','ip','client-uuid'] │ ['','','','proto','Go-2.22.1','C02TV431HV2Q','192.168.1.43','3f9574079594605c'] │ [0,1,0,0,0,0,0,0] │ [20,0,0,0,0,0,0,0] │ [0,0,3.14,0,0,0,0,0] │
└────────────────────────────┴──────────────────┴──────────────┴───────────┴────────────┴───────────────────────────────────────────────

Clickhouse doesn't allow direct lookups against nested fields like tags.ip=192.168.1.43, instead you have to do a JOIN ARRAY which results in this new property.

bobrik commented 4 years ago

I think I might be more confused about tag queries that I initially realized, please take my explanation with a big grain of salt.

bobrik commented 4 years ago

Turns out that having a sparse tags with nested fields is not great for searches when you have a lot of spans, so went back to array of key=value strings with a bloom filter index on top.

I've tried combining span storage and indexing into one table, but that proved to be very detrimental to performance of lookup by trace id. It compressed somewhat better, though.

If anyone wants to give it a spin, please be my guest:

https://github.com/bobrik/jaeger/tree/ivan/clickhouse/plugin/storage/clickhouse

It feels a lot snappier than Elasticsearch, and it's at least 2.5x more compact (4.4x if you compare to Elasticsearch behaviour out of the box).

levonet commented 4 years ago

@bobrik What stops Pull Request? Are you planning to do more optimization?

bobrik commented 4 years ago

There were some issue with ClickHouse that prevented me from getting reasonable latency for the amount of data we have.

This is the main one:

https://github.com/ClickHouse/ClickHouse/issues/11564

See also:

There are also quite a few local changes that I haven't pushed yet.

I'll see if I can carve out some more time for this to push the latest changes, but no promises on ETA.

levonet commented 4 years ago

Another question.

Maybe it makes sense to divide the settings of tables jaeger_index_v2 and jaeger_spans_v2 into read and write? This will create tables on the cluster as follows:

insert -> jaeger_*_v2_buffer -> jaeger_*_v2_distributed -> jaeger_*_v2
select -> jaeger_*_v2_distributed -> jaeger_*_v2

This will take the load off Jaeger when it collects the batches (no need to make large batches on high load).

levonet commented 4 years ago

It seems that all the necessary changes have been made in the latest versions of ClickHouse. @bobrik Are you still working on the plugin?

bobrik commented 4 years ago

Yes and no. The code is pretty solid and it's been running on a single node Clickhouse for the last month, but I'm waiting on an internal Cluster provisioning to finish to test it more broadly with real humans making queries. My plan is to add more tests and make a PR upstream after that.

The latest code here:

https://github.com/bobrik/jaeger/tree/ivan/clickhouse/plugin/storage/clickhouse

bzon commented 4 years ago

@bobrik Impressive. I can test this on my clickhouse cluster. A few question:

How is the performance compared to previous backend solutions you tried?
How to configure the jaeger collector to use clickhouse?
Will jaeger collector create the tables?

bobrik commented 4 years ago

How is the performance compared to previous backend solutions you tried?

First, here's how stock Elasticsearch backend compares to our modified one (2x replication in both cases):

$ curl -s 'https://foo.baz/_cat/indices/jaeger-span-2020-05-20?s=index&v&h=index,pri,docs.count,store.size,segments.memory,segments.fixed_bitset_memory,fielddata.memory_size,memory.total'
index                  pri  docs.count store.size segments.memory segments.fixed_bitset_memory fielddata.memory_size memory.total
jaeger-span-2020-05-20  38 81604373583      8.9tb          17.1gb                         38gb                    0b       17.3gb

$ curl 'https://foo.bar/_cat/indices/jaeger-span-2020-05-20?s=index&v&h=index,pri,docs.count,store.size,segments.memory,segments.fixed_bitset_memory,fielddata.memory_size,memory.total'
index                  pri docs.count store.size segments.memory segments.fixed_bitset_memory fielddata.memory_size memory.total
jaeger-span-2020-05-20  38  8406585575      4.8tb         192.1mb                           0b                    0b      192.1mb

We don't use nested docs and sort the index, which:

Removes an issue when shards run out of docs (2147483519 is the limit)
Disk space usage is 2x more efficient, 312 bytes per span
Index memory usage is down from almost 17.3GiB to just 0.2GiB
Bitset memory usage is down from 38GiB to just 0B

As you can see, this was back in May. Now we can compare improved Elasticsearch to Clickhouse:

$ curl 'https://foo.bar/_cat/indices/jaeger-span-2020-08-28?s=index&v&h=index,pri,docs.count,store.size,segments.memory,segments.fixed_bitset_memory,fielddata.memory_size,memory.total'
index                  pri  docs.count store.size segments.memory segments.fixed_bitset_memory fielddata.memory_size memory.total
jaeger-span-2020-08-28  38 13723317165      8.1tb         331.2mb                           0b                    0b      916.1mb

SELECT
       table,
       partition,
       sum(marks) AS marks,
       sum(rows) AS rows,
       formatReadableSize(sum(data_compressed_bytes)) AS compressed_bytes,
       formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed_bytes,
       toDecimal64(sum(data_uncompressed_bytes) / sum(data_compressed_bytes), 2) AS compression_ratio,
       formatReadableSize(sum(data_compressed_bytes) / rows) AS bytes_per_row,
       formatReadableSize(sum(primary_key_bytes_in_memory)) AS pk_in_memory
  FROM system.parts
 WHERE (table IN ('jaeger_index_v2', 'jaeger_spans_v2', 'jaeger_archive_spans_v2', '.inner.jaeger_operations_v2'))
   AND active
   AND partition = '2020-08-28'
 GROUP BY table, partition
 ORDER BY table ASC, partition ASC

┌─table───────────┬─partition──┬────marks─┬────────rows─┬─compressed_bytes─┬─uncompressed_bytes─┬─compression_ratio─┬─bytes_per_row─┬─pk_in_memory─┐
│ jaeger_index_v2 │ 2020-08-28 │ 13401691 │ 13723301235 │ 294.31 GiB       │ 2.75 TiB           │              9.56 │ 23.03 B       │ 115.04 MiB   │
│ jaeger_spans_v2 │ 2020-08-28 │ 13401693 │ 13723301235 │ 757.75 GiB       │ 4.55 TiB           │              6.14 │ 59.29 B       │ 358.88 MiB   │
└─────────────────┴────────────┴──────────┴─────────────┴──────────────────┴────────────────────┴───────────────────┴───────────────┴──────────────┘

Disk usage is down from 4.0TiB to 1.0TiB, 4x (bytes per span is down from 324 to 82)
Memory usage is roughly the same
Search performance is roughly the same, but I'm comparing single Clickhouse node to 48 Elasticsearch nodes
Trace lookup is instantaneous, since it's a primary key lookup

How to configure the jaeger collector to use clickhouse?

Set SPAN_STORAGE_TYPE=clickhouse env variable and then --clickhouse.datasource to point to your Clickhouse database URL, for example: tcp://localhost:9000?database=jaeger.

Will jaeger collector create the tables?

No, there are multitudes of ways to organize tables and you may want to tweak table settings yourself, that's why I only provide an example starting schema. This is also part of the reason I want to test on our production cluster rather than on a single node before I make a PR.

Please consult the docs: https://github.com/bobrik/jaeger/tree/ivan/clickhouse/plugin/storage/clickhouse#schema

Keep in mind that you need at least Clickhouse v20.7.1.4189 to have reasonable performance.

mcarbonneaux commented 4 years ago

it's very promising ! you've try to use chproxy in for caching request ? or tricksterproxy who are optimized for request by time range...

levonet commented 4 years ago

The first and second proxy is not suitable because both use the HTTP protocol. I'm now thinking about to start the controller before each Clickhouse. And defining on each agent a list of all controllers at the time of deployment. Otherwise, there are will have to do TCP balancing.

mcarbonneaux commented 4 years ago

The first and second proxy is not suitable because both use the HTTP protocol.

hooo your backend use the native tcp protocol of clickhouse !

yes you use: https://github.com/ClickHouse/clickhouse-go that use native protocol !

yew1eb commented 4 years ago

How is the performance compared to previous backend solutions you tried?

First, here's how stock Elasticsearch backend compares to our modified one (2x replication in both cases):
$ curl -s 'https://foo.baz/_cat/indices/jaeger-span-2020-05-20?s=index&v&h=index,pri,docs.count,store.size,segments.memory,segments.fixed_bitset_memory,fielddata.memory_size,memory.total'
index                  pri  docs.count store.size segments.memory segments.fixed_bitset_memory fielddata.memory_size memory.total
jaeger-span-2020-05-20  38 81604373583      8.9tb          17.1gb                         38gb                    0b       17.3gb

$ curl 'https://foo.bar/_cat/indices/jaeger-span-2020-05-20?s=index&v&h=index,pri,docs.count,store.size,segments.memory,segments.fixed_bitset_memory,fielddata.memory_size,memory.total'
index                  pri docs.count store.size segments.memory segments.fixed_bitset_memory fielddata.memory_size memory.total
jaeger-span-2020-05-20  38  8406585575      4.8tb         192.1mb                           0b                    0b      192.1mb
We don't use nested docs and sort the index, which:

Removes an issue when shards run out of docs (2147483519 is the limit)

Disk space usage is 2x more efficient, 312 bytes per span

Index memory usage is down from almost 17.3GiB to just 0.2GiB

Bitset memory usage is down from 38GiB to just 0B

As you can see, this was back in May. Now we can compare improved Elasticsearch to Clickhouse:
$ curl 'https://foo.bar/_cat/indices/jaeger-span-2020-08-28?s=index&v&h=index,pri,docs.count,store.size,segments.memory,segments.fixed_bitset_memory,fielddata.memory_size,memory.total'
index                  pri  docs.count store.size segments.memory segments.fixed_bitset_memory fielddata.memory_size memory.total
jaeger-span-2020-08-28  38 13723317165      8.1tb         331.2mb                           0b                    0b      916.1mb
SELECT
       table,
       partition,
       sum(marks) AS marks,
       sum(rows) AS rows,
       formatReadableSize(sum(data_compressed_bytes)) AS compressed_bytes,
       formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed_bytes,
       toDecimal64(sum(data_uncompressed_bytes) / sum(data_compressed_bytes), 2) AS compression_ratio,
       formatReadableSize(sum(data_compressed_bytes) / rows) AS bytes_per_row,
       formatReadableSize(sum(primary_key_bytes_in_memory)) AS pk_in_memory
  FROM system.parts
 WHERE (table IN ('jaeger_index_v2', 'jaeger_spans_v2', 'jaeger_archive_spans_v2', '.inner.jaeger_operations_v2'))
   AND active
   AND partition = '2020-08-28'
 GROUP BY table, partition
 ORDER BY table ASC, partition ASC
┌─table───────────┬─partition──┬────marks─┬────────rows─┬─compressed_bytes─┬─uncompressed_bytes─┬─compression_ratio─┬─bytes_per_row─┬─pk_in_memory─┐
│ jaeger_index_v2 │ 2020-08-28 │ 13401691 │ 13723301235 │ 294.31 GiB       │ 2.75 TiB           │              9.56 │ 23.03 B       │ 115.04 MiB   │
│ jaeger_spans_v2 │ 2020-08-28 │ 13401693 │ 13723301235 │ 757.75 GiB       │ 4.55 TiB           │              6.14 │ 59.29 B       │ 358.88 MiB   │
└─────────────────┴────────────┴──────────┴─────────────┴──────────────────┴────────────────────┴───────────────────┴───────────────┴──────────────┘
Disk usage is down from 4.0TiB to 1.0TiB, 4x (bytes per span is down from 324 to 82)

Memory usage is roughly the same

Search performance is roughly the same, but I'm comparing single Clickhouse node to 48 Elasticsearch nodes

Trace lookup is instantaneous, since it's a primary key lookup

How to configure the jaeger collector to use clickhouse?

Set SPAN_STORAGE_TYPE=clickhouse env variable and then --clickhouse.datasource to point to your Clickhouse database URL, for example: tcp://localhost:9000?database=jaeger.

Will jaeger collector create the tables?

No, there are multitudes of ways to organize tables and you may want to tweak table settings yourself, that's why I only provide an example starting schema. This is also part of the reason I want to test on our production cluster rather than on a single node before I make a PR.

Please consult the docs: https://github.com/bobrik/jaeger/tree/ivan/clickhouse/plugin/storage/clickhouse#schema

Keep in mind that you need at least Clickhouse v20.7.1.4189 to have reasonable performance.

@bobrik hi, I don't understand what this field model String CODEC(ZSTD(3)) means in the jaeger_spans_v2 table. https://github.com/bobrik/jaeger/tree/ivan/clickhouse/plugin/storage/clickhouse#schema I have another question, how to store the span tags value?

levonet commented 3 years ago

We have been testing the Clickhouse plugin for almost 2 months. So far, the load on each cluster is not very large ~ 1 billion spans per day. For such a configuration as here in the example, it is not noticeable at all. Maybe someone else will be interested to test Jaeger with Clickhoause https://github.com/levonet/docker-jaeger

I really want the Clickhouse plugin to finally be released in the Jaeger.

jpkrohling commented 3 years ago

I really want the Clickhouse plugin to finally be released in the Jaeger.

It is unlikely that we'll add any other storage mechanism to the core repository in the near future. What will probably happen is that we'll change the architecture of how this works for Jaeger v2, making it easier to plug your own storage plugin.

levonet commented 3 years ago

@jpkrohling How long to wait for Jaeger v2? And will a separate organization or repository be created for the community?

jpkrohling commented 3 years ago

Jaeger v2 is at least a few months away, in the best case scenario.

And will a separate organization or repository be created for the community?

Not sure I understand. Do you mean for the externally contributed storage mechanisms? That's a great question, although a bit early to ask :-)

yurishkuro commented 3 years ago

I would be -1 on creating such a repo under Jaeger org, because it creates an expectation that project maintainers are responsible for it. There's nothing wrong with having a plugin repo elsewhere, under the org of the real authors/maintainers, as is the case with InfluxDB and Logz.io plugins. If such repo is well-maintained, we can link to it from the official docs.

bobrik commented 3 years ago

Good to know about v2 plans with regards to plugins.. I'll revisit a more official place for the Clickhouse plugin when v2 is ready, but for now if anyone wants to use the current implementation, feel free to apply patches from my branch:

https://github.com/bobrik/jaeger/tree/ivan/clickhouse/plugin/storage/clickhouse

We're using it internally at Cloudflare.

yurishkuro commented 3 years ago

@bobrik to clarify, you implemented CH storage provider as a built-in component in a fork, not as gRPC plugin that can be attached at runtime, correct?

bobrik commented 3 years ago

@yurishkuro correct.

meshpaul commented 3 years ago

How is the performance compared to previous backend solutions you tried?

First, here's how stock Elasticsearch backend compares to our modified one (2x replication in both cases):
$ curl -s 'https://foo.baz/_cat/indices/jaeger-span-2020-05-20?s=index&v&h=index,pri,docs.count,store.size,segments.memory,segments.fixed_bitset_memory,fielddata.memory_size,memory.total'
index                  pri  docs.count store.size segments.memory segments.fixed_bitset_memory fielddata.memory_size memory.total
jaeger-span-2020-05-20  38 81604373583      8.9tb          17.1gb                         38gb                    0b       17.3gb

$ curl 'https://foo.bar/_cat/indices/jaeger-span-2020-05-20?s=index&v&h=index,pri,docs.count,store.size,segments.memory,segments.fixed_bitset_memory,fielddata.memory_size,memory.total'
index                  pri docs.count store.size segments.memory segments.fixed_bitset_memory fielddata.memory_size memory.total
jaeger-span-2020-05-20  38  8406585575      4.8tb         192.1mb                           0b                    0b      192.1mb
We don't use nested docs and sort the index, which:

Removes an issue when shards run out of docs (2147483519 is the limit)

Disk space usage is 2x more efficient, 312 bytes per span

Index memory usage is down from almost 17.3GiB to just 0.2GiB

Bitset memory usage is down from 38GiB to just 0B

As you can see, this was back in May. Now we can compare improved Elasticsearch to Clickhouse:
$ curl 'https://foo.bar/_cat/indices/jaeger-span-2020-08-28?s=index&v&h=index,pri,docs.count,store.size,segments.memory,segments.fixed_bitset_memory,fielddata.memory_size,memory.total'
index                  pri  docs.count store.size segments.memory segments.fixed_bitset_memory fielddata.memory_size memory.total
jaeger-span-2020-08-28  38 13723317165      8.1tb         331.2mb                           0b                    0b      916.1mb
SELECT
       table,
       partition,
       sum(marks) AS marks,
       sum(rows) AS rows,
       formatReadableSize(sum(data_compressed_bytes)) AS compressed_bytes,
       formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed_bytes,
       toDecimal64(sum(data_uncompressed_bytes) / sum(data_compressed_bytes), 2) AS compression_ratio,
       formatReadableSize(sum(data_compressed_bytes) / rows) AS bytes_per_row,
       formatReadableSize(sum(primary_key_bytes_in_memory)) AS pk_in_memory
  FROM system.parts
 WHERE (table IN ('jaeger_index_v2', 'jaeger_spans_v2', 'jaeger_archive_spans_v2', '.inner.jaeger_operations_v2'))
   AND active
   AND partition = '2020-08-28'
 GROUP BY table, partition
 ORDER BY table ASC, partition ASC
┌─table───────────┬─partition──┬────marks─┬────────rows─┬─compressed_bytes─┬─uncompressed_bytes─┬─compression_ratio─┬─bytes_per_row─┬─pk_in_memory─┐
│ jaeger_index_v2 │ 2020-08-28 │ 13401691 │ 13723301235 │ 294.31 GiB       │ 2.75 TiB           │              9.56 │ 23.03 B       │ 115.04 MiB   │
│ jaeger_spans_v2 │ 2020-08-28 │ 13401693 │ 13723301235 │ 757.75 GiB       │ 4.55 TiB           │              6.14 │ 59.29 B       │ 358.88 MiB   │
└─────────────────┴────────────┴──────────┴─────────────┴──────────────────┴────────────────────┴───────────────────┴───────────────┴──────────────┘
Disk usage is down from 4.0TiB to 1.0TiB, 4x (bytes per span is down from 324 to 82)

Memory usage is roughly the same

Search performance is roughly the same, but I'm comparing single Clickhouse node to 48 Elasticsearch nodes

Trace lookup is instantaneous, since it's a primary key lookup

How to configure the jaeger collector to use clickhouse?

Set SPAN_STORAGE_TYPE=clickhouse env variable and then --clickhouse.datasource to point to your Clickhouse database URL, for example: tcp://localhost:9000?database=jaeger.

Will jaeger collector create the tables?

No, there are multitudes of ways to organize tables and you may want to tweak table settings yourself, that's why I only provide an example starting schema. This is also part of the reason I want to test on our production cluster rather than on a single node before I make a PR.

Please consult the docs: https://github.com/bobrik/jaeger/tree/ivan/clickhouse/plugin/storage/clickhouse#schema

Keep in mind that you need at least Clickhouse v20.7.1.4189 to have reasonable performance.

@bobrik Did you get to run this on a real CH cluster? I would imagine if the sharding is done on a CH cluster the search and ingestion performance should have been improved.

bobrik commented 3 years ago

Yes, we run it on a 3 node cluster with 3x replication these days.

meshpaul commented 3 years ago

Yes, we run it on a 3 node cluster with 3x replication these days.

So how does this then compare to ElasticSearch equivalent. In terms of infrastructure need for back-ends ClickHouse vs ElasticSearch

cluster size
data ingestion size
queries

Ex you mentioned previously you were comparing Search performance is roughly the same, but I'm comparing single Clickhouse node to 48 Elasticsearch nodes And I would like to clarify that as per your statement 1 CH server performs as 48 ES nodes in terms of total infrastructure to meet storage, search and ingestion aspect of performance?

In addition do you mind sharing the total user-base of ES vs CH?

Any additional info you can supply is greatly appreciated!

bobrik commented 3 years ago

In terms of cluster size: ES was 48 nodes, ClickHouse is 3 nodes. Same amount of spans ingested in each: ~250K spans/s.

Same queries and the same user base for both, since both serve the same UI endpoint (ES served it before, ClickHouse does it now).

Tracing wasn't the only thing in the ES cluster, but it was by far the biggest and occupied more than half of resources.

mcspring commented 3 years ago

@bobrik Are you still working on this issue? I find some issues with you forked version, and I have some commits to fix them. Should I submit a PR to you?

jkowall commented 3 years ago

@bobrik Very nice, much more efficient, but there isn't a scale-out mechanism really for ClickHouse which is a big concern for us. You also wouldn't have Kibana to query it, which is pretty useful for dashboarding and analytics. Here is a dashboard we use from Kibana (this is demo data).

Maybe there is interest in building a storage plugin within Jaeger v2 to make this more maintainable?

Slach commented 3 years ago

@jkowall why you think ClickHouse doesn't have scale-out? It near-linear scalability available in ClickHouse itself by design

just add shard with hosts to <remote_servers> in /etc/clickhouse-server/config.d/remote_servers.xml and run CREATE TABLE ... Engine=Distributed(cluster_name, db.table, sharding_key_expression) ON CLUSTER ... ;

after it you can insert into any host and data will spreaded via sharding key, or you can use chproxy \ haproxy to data ingestion directly into MergeTree table

also you can read from all servers with smarter aggregation from Distributed table

jkowall commented 3 years ago

@Slach yes I saw that, but if you have hundreds of nodes it can be an issue I would assume. We run thousands of nodes of ES. Bigger issue would be supporting ClickHouse in Kibana, but that's for another project :)

bobrik commented 3 years ago

@mcspring I'm not actively working on it at the moment, since it doesn't require any maintenance from me. Feel free to submit a PR and I'll get to it eventually.

Slach commented 3 years ago

@jkowall try to use https://github.com/Altinity/clickhouse-operator/

otisg commented 3 years ago

It is unlikely that we'll add any other storage mechanism to the core repository in the near future. What will probably happen is that we'll change the architecture of how this works for Jaeger v2, making it easier to plug your own storage plugin.

@jpkrohling is this still the case? https://github.com/jaegertracing/jaeger/issues/638 is still open. Are you saying this should really be closed because a completely different approach is being pursued?

making it easier to plug your own storage plugin.

I tried finding an issue for this, but could not find it. Is there an issue for this? Thanks!

yurishkuro commented 3 years ago

@otisg don't know about the issue, but the approach is that we're rebuilding Jaeger backend on top of OpenTelemetry collector, which has a mechanism of combining various packages (even from different repos) at build time without explicit code dependencies in the core. That means the storage implementations can directly implement storage interface and don't have to go through much less efficient grpc-plugin interface.

jpkrohling commented 3 years ago

I don't think we have an issue tracking this specific feature, but we know it will be there for v2, which is being tracked here: https://github.com/jaegertracing/jaeger/milestone/15

chhetripradeep commented 3 years ago

Based on personal experience, i would like to recommend adding clickhouse as one of core storage options for jaeger (like elasticsearch and cassandra), if not, we should probably port Ivan's work as a gRPC plugin.

There are few benefits I can think of:

Clickhouse is lot better in disk i/o as compared to ES. Clickhouse scales very well with magnetic disks but scaling ES requires ssds.
Clickhouse has very good data compression support as compared to ES.

Uber's migration of their logging pipeline from ES to clickhouse[0] is a very good example of clickhouse performance.

[0] - https://eng.uber.com/logging/

levonet commented 3 years ago

I also like the idea of adding a plugin as one of the core storage options for jaeger. This will allow to organically add a clickhouse to existing jaeger-operator and helm-charts.

We have been using Ivan's plugin for half a year. The use of a clickhouse as storage is almost invisible in terms of infrastructure resources. Docker images with an updated version of jaeger and a startup example can be found here https://github.com/levonet/docker-jaeger if anyone is interested.

pavolloffay commented 3 years ago

I have migrated https://github.com/bobrik/jaeger/tree/ivan/clickhouse/plugin/storage/clickhouse to storage plugin. The source code is available here https://github.com/pavolloffay/jaeger-clickhouse.

For now I am not planning on maintaining it, I have built it for our internal purposes. However if there is interest and somebody would like to help to maintain it I am happy to chat about it.

The repository contains instructions how to run it and the code was tested locally (see readme). When I run the code on more data I get Too many simultaneous queries. Maximum: 100, see https://github.com/pavolloffay/jaeger-clickhouse/issues/2. It might be just DB configuration issue.

jaegertracing / jaeger

ClickHouse as a storage backend #1438