Open shenqidebaozi opened 3 months ago
@shenqidebaozi do you have any performance comparison of the traces search procedure?
@akvlad Currently, in the research of different products, there is no comparison of trace search performance. But I think storage costs are also important.
@shenqidebaozi according to your opinion. How many GB of HDD can be a completely equal replacement of 1 CPU core? In terms of price.
This is a good question, I don't know how to measure it. But for me, these projects can run with the same configuration, so I hope qryn can take up less space. I briefly reviewed the code and found that Payload seems to have not been actually used.
If there's anything really unused it can be avoided but I'm not sure that's the case. Compression and codec choices might also play a vital role and should be carefully reviewed.
This could be because Uptrace uses zstd compression by default with Clickhouse. Does qryn allow specifying compression?
qryn seems to be using zstd in only 3-4 fields, which explains the difference in size. Having an option to allow using zstd whenever possible, would reduce disk usage substantially.
It would be useful to have an ENV for specifying compression algorithm and Level in qryn. For example the default zstd level is 1 compared to 3 when using zstd cli.
This query shows that uptrace lets you configure compression type/level and that gets appended to the clickhouse schema.
https://github.com/search?q=repo%3Auptrace%2Fuptrace%20ch_schema&type=code
select
sum(rows) as row,
formatReadableSize(sum(data_uncompressed_bytes)) as ysq,
formatReadableSize(sum(data_compressed_bytes)) as ysh,
round(sum(data_compressed_bytes) / sum(data_uncompressed_bytes) * 100, 0) ys_rate
from system.parts;
What is the specific purpose of the payload
field? It saved the original trace information, but I don't seem to have found the usage of this field. In addition, this field has added an additional JSON serialization in both qryn
and qryn otel-collector
, which incurs additional performance overhead.
https://github.com/metrico/qryn/blob/b4cda9e656f6720fe5ee7eaccb9dcd2c12e6ac82/lib/db/otlp.js#L18
For the second question, should we define payload
as ClickHouse Nested
so that JSON marshal into a string is not necessary. Also, may there be better compression effects?
Ths only downsize of using Nested is that it makes the field more strict than a string.
Also worth mentioning fields like: service_name, parent_id, payload_type, payload, tags
are all highly redundant and would benefit from compression.
@gaby we absolutely want compression choices to be as open as possible for experimenting. We could work on a set of ALTER statements we can use to experiment with.
@lmangani That would be a good starting point, or updating the CREATE TABLE and testing with a big data set to see the difference in size/performance. Compression will add ltency and reduce throughput thus why it should be configurable.
Ths only downsize of using Nested is that it makes the field more strict than a string.
Also worth mentioning fields like:
service_name, parent_id, payload_type, payload, tags
are all highly redundant and would benefit from compression.
@gaby can also reduce JSON marshal、unmarshal once,this is helpful for bulk write and query
According to ChatGPT the same Create Table SQL would look like:
CREATE TABLE IF NOT EXISTS {{DB}}.traces_input {{{OnCluster}}} (
oid String DEFAULT '0' CODEC(ZSTD),
trace_id String CODEC(ZSTD),
span_id String CODEC(ZSTD),
parent_id String CODEC(ZSTD),
name String CODEC(ZSTD),
timestamp_ns Int64 CODEC(DoubleDelta, ZSTD),
duration_ns Int64 CODEC(ZSTD),
service_name String CODEC(ZSTD),
payload_type Int8 CODEC(ZSTD),
payload String CODEC(ZSTD),
tags Array(Tuple(String, String)) CODEC(ZSTD)
) Engine=Null
When asked for adding levels based on field type it producss the following:
CREATE TABLE IF NOT EXISTS {{DB}}.traces_input {{{OnCluster}}} (
oid String DEFAULT '0' CODEC(ZSTD(1)),
trace_id String CODEC(ZSTD(3)), -- Likely to benefit from more compression
span_id String CODEC(ZSTD(3)), -- Likely to benefit from more compression
parent_id String CODEC(ZSTD(1)),
name String CODEC(ZSTD(1)),
timestamp_ns Int64 CODEC(DoubleDelta, ZSTD(3)), -- Larger data size, benefits from more compression
duration_ns Int64 CODEC(ZSTD(3)), -- Larger data size, benefits from more compression
service_name String CODEC(ZSTD(1)),
payload_type Int8 CODEC(ZSTD(1)),
payload String CODEC(ZSTD(3)), -- Assuming payloads can be large/structured, they might benefit more
tags Array(Tuple(String, String)) CODEC(ZSTD(1))
) Engine=Null
@gaby let us know how this plays out and if it produces a visible effect we can most definitely implement options to trigger it
For the same 10 million traces, qryn requires 18GB of storage, while uptrace only requires 4GB, which seems to be due to the inability to optimize the payload use of JSON. https://github.com/metrico/qryn/blob/b4cda9e656f6720fe5ee7eaccb9dcd2c12e6ac82/lib/db/maintain/scripts.js#L158-L170