Closed mr-karan closed 2 years ago
@mr-karan in a pipeline, actions happen in the sequence they appear - so the json filtering happens after the json parsing, which happens after the clickhouse select. The following filter would produce results closer to your expectations at higher speeds:
{job="app",source="haproxy"} |~ "ABCD1234" | json | user="ABCD1234"
@lmangani The problem with using |~ABCD1234
as the first stage is basically Clickhouse has to scan through all the rows and do a string match of ABCD1234
. It's quite slow and with around 117mn rows, it took ~27s:
1 rows in set. Elapsed: 26.167 sec. Processed 117.02 million rows, 70.85 GB (4.47 million rows/s., 2.71 GB/s.)
What would be a good way to speed this up? I understand the limitation that filters are applied after the parsing stage, but if it would somehow be possible to send the where user=ABCD1234
to Clickhouse I believe that would have a significant performance boost. Just thinking out loud.
It's hard to tell if this performance is good or bad without knowing anything about your cluster/node specs and what timespan this query did include. Could you provide a more detailed picture of this usecase?
I re-read your comment. The behavior you are expecting only applies to the labels, not logs - for which there is no format expectation whatsoever, they could be strings or whatnot - So to boost performance, you should add userId
as a label for this to work as fast as clickhouse can by allowing it to efficiently limit the returned rows pre-parsing.
So to boost performance, you should add userId as a label for this to work as fast as clickhouse can by allowing it to efficiently limit the returned rows pre-parsing.
Ah, would high cardinality not deter the performance? Loki advises not to put labels which have dynamic values. So, if I have a million distinct user IDs, would the number of fingerpint hashes be a LOT for Clickhouse to query? Do you foresee any problems with this, or do you think the index scan in Loki is slow but the same won't be an issue with Clickhouse?
Could you provide a more detailed picture of this usecase?
Sure. Currently, I am ingesting a file of 250GB (raw, uncompressed) and testing out cLoki. This is the data that I need searchable for a couple of weeks and there's some custom logic to prune perhaps 70% of the logs after a month and remaining 30% should be searchable for long duration.
I am running vector
, cLoki
and clickhouse-server
all on a single EC2 instance (8 vCPU/64GB). clickhouse-server
actually isn't taking up much resources, but TBH I'm running an out of box default config and haven't looked so far at tweaking it.
The 117mn log lines were searched in a timespan of 1h (busy time for the app).
Thanks for sharing the details!
Ah, would high cardinality not deter the performance? Loki advises not to put labels which have dynamic values.
Our labels are just stored JSON objects with time boundaries and your semi-idle Clickhouse should have no issues dealing with this synthetic cardinality - actually this is exactly one of the limits we're pushing with cloki and your usecase might help demonstrate how plausible this approach is and what optimizations might benefit its design.
Sounds interesting. I'll re-configure the labels and get back. Thanks for your advise!
@mr-karan did the suggested approach sort any better results?
@lmangani So, i did try adding a user_id as label. The performance didn't become worse but it became only slightly better.
To give some context: I've been evaluating Loki for storing and analysing TBs of logs. I checked out this project because Clickhouse is really fast columnar DB and purpose built for it. But the approach that Loki takes of "indexing" and separate "chunk" store, I don't really understand why it's needed in Clickhouse or how it'll make things faster. I looked at the code and saw there's a LEFT JOIN with fingerprint and the log data, from my analysis, it looks like this is actually slowing things down. Instead of utilising the DB for index, the additional table lookup and then a JOIN on it, I am not really sure if it's an optimal way of doing things. And hence, I've switched to just Vanilla Clickhouse and experimenting more with it.
OTOH, this is an excellent approach/project for someone who needs a Loki like UI but Clickhouse storage. Since it didn't fit my exact usecase I've described above, I've not had the chance to benchmark it properly and don't have concrete numbers to share, so I'll close this issue.
Thanks a lot for your time and really appreciate your help.
I think you might have some misunderstandings and assumptions about our actual design - the fingerprinting concept is from prometheus and it provides an exceptional lookup layer, nothing to do with the "chunk" store as we have no chunks - now unfortunately without any benchmarks this observation doesn't help improve anything does it. I think it would be fair to share back those "slow" examples if you really had any.
I've not had the chance to benchmark it properly and don't have concrete numbers to share,
Just curious, what are the above statements based on exactly then?
Thanks for your time,
@mr-karan . One more advice:
{job="app",source="haproxy"} | json | user="ABCD1234"
- you are right. It has issues with "parameterless" | json
parser as it is processed on the JS side.
But another request:
{job="app",source="haproxy"} | json user="user" | user="ABCD1234"
- this should work.
The issue about | json
will be debugged in the next version. Maybe you can open a new isssue or reopen this one, so I can ping you when this | json
with no parameters case is fixed.
Firstly, I wanna add a note that I hope you're not getting annoyed with the issues I've opened :) I really am enjoying using
cLoki
and just presenting some more minor feature additions etc in the hopes that it'll make it even better.I'd a question about how Label Queries are implemented. (I read the docs and found that
| json | my_field="<value>"
is supported.However, I am seeing a weird edge case in my logs.
My logs are in JSON and the behavior when I search for the fields is different
I did some debugging using Clickhouse Query logs and found that for both of the above queries,
cLoki
sends the same query to Clickhouse:{job="app",source="haproxy"} | json | user="ABCD1234"
{job="app",source="haproxy"} | json
The query which is sent is:
I was expecting a mention of
ABCD1234
in Clickhouse itself as it's the source of truth for querying. But I am guessing what is happening is that the extra label filters are processed/searched in thecLoki
app level and not Clickhouse.Which leads to the behaviour, where if the first 1000 lines of the dataset doesn't have the
user=ABCD1234
, the app level search will simply return an empty list. I could be wrong in my assumptions here but to me this looks like the case.Should we not do a
JSONExtractString()
on thestring
field and pass the value ofuser
to Clickhouse itself?Thanks, again!