Closed Tyrion85 closed 1 year ago
Thanks for the report @Tyrion85
Are you using a file-backed metastore (on s3?) or a postgresql metastore?
sorry, should've mentioned that part, my bad. Its a file-backed metastore on GCP, afaik underlying code treats it as "s3" and this is the config I use, but its a GCP bucket.
I've just reproduced this again, I added more memory to Metastore pod, which obviously resulted in creation of new pod, and I "lost" all traces that were visible moments before adding more memory.
I obviously don't know anything about the source code, so making a bunch of guesses here - maybe its not the issue with Metastore per se, but with the fact I need to restart indexer manually when Metastore crashes? I don't have a persistent volume for indexer, because by default its turned off in helm chart values. Should I have a persistent volume for indexer? Maybe this is causing the "loss" of data?
Ok, this is definitely a bug and the metastore should not behave like this.
How many splits do you have? What is your memory limit on the metastore pod?
The problem with the file-backed metastore is that.. is works with only one writer, which should be the metastore pod. It's possible to get some unexpected behavior when you have two metastore pods that are writing to the same metastore file (which should be at indexes/my-index/metastore.json
). For example, a possible scenario is: a metastore pod add a new split in the metastore.json and make a put to the object storage. But the other metastore pod could do the same thing... so you can lose a new split for example or... if there is a merge operation that has just finished, you can lose that too.
I can try to reproduce this behavior on our EKS cluster.
Should I have a persistent volume for indexer?
You can lose some data yes. But it's data that is not committed to the metastore. For the trace index, we set a variable called commit_timeout
to 5s
which means that each 5 second, incoming data are packaged and publish to the metastore (this phase could take a few seconds). So you can lose 10 to 20 seconds of data.
But you should not lose already committed data (= splits present in the metastore).
If you can share the metastore.json
in a DM on discord I can investigate more on this to understand why your metastore is taking too much memory.
How many splits do you have? What is your memory limit on the metastore pod?
Around 5k+ splits (objects in a bucket). Memory limit was 1Gi and it was getting exceeded, it is now bumped to 3Gi.
It's possible to get some unexpected behavior when you have two metastore pods that are writing to the same metastore file (which should be at
indexes/my-index/metastore.json
).
If there's an issue with two Metastore pods coexisting, maybe we should switch from k8s Deployment strategy.type
RollingUpdate
to Recreate
, or use a StatefulSet instead, to ensure only one pod exists? As it is right now with RollingUpdate in my current setup (default in helm chart), this can cause an issue every time podTemplate changes for any reason I guess.
If you can share the
metastore.json
in a DM on discord I can investigate more on this to understand why your metastore is taking too much memory.
I'll try to do that, thank you! 🙏🏼 In the meantime, I did notice that the vast majority of that metastore.json
file has split_state=MarkedForDeletion
- only two entries are Published
. This probably relates to https://github.com/quickwit-oss/quickwit/issues/3258 ? Could it be that I shouldn't even see "old data" because it is already MarkedForDeletion, but I see it by accident because janitor cannot delete stuff from GCP bucket? Although, there are only about 1k splits in that file, and there are 5k+ files in a bucket.
@Tyrion85 I tried to reproduce your issue with the 0.6 version with thousands of splits. The memory taken by the metastore remains super low (under 200MB).
The 0.6 should fix the not garbage collected MarkedForDeletion
splits.
To avoid two file-backed metastore writers, we need to fix the helm chart, I will open an issue on the helm repo.
Can you try to update to 0.6 and monitor the metastore memory usage to see if it's stable?
@fmassot sorry for the late response, finally got some time to devote to this 🙏
Since the upgrade to 0.6, memory usage for metastore hovers around 9MB! 🎉 Amazing improvement is visible in this chart:
I also switched from file-backed metastore to postgres-backed metastore, and things have been much more stable since!
However, there are still errors like this one
Failed to fetch hotcache and footer from s3://.../indexes/otel-traces-v0_6 for split `01H6F9VP5YM9GGQG6FXW7MCMXR``., split_id: 01H6F9VP5YM9GGQG6FXW7MCMXR), (Internal error: `Failed to fetch hotcache and footer from s3://.../indexes/otel-traces-v0_6 for split `01H6FYNZQKDKGFCBTTSDPZWCYD``., split_id: 01H6FYNZQKDKGFCBTTSDPZWCYD), (Internal error: `Failed to fetch hotcache and footer from s3://.../indexes/otel-traces-v0_6 for split `01H6FEK42XJAJ7VVK6QVRT75VT``., split_id: 01H6FEK42XJAJ7VVK6QVRT75VT)`.", "level":"error", "msg":"HTTP handler, Internal Server Error", "stacktrace":"github.com/jaegertracing/jaeger/cmd/query/app.(*APIHandler).handleError
github.com/jaegertracing/jaeger/cmd/query/app/http_handler.go:494
github.com/jaegertracing/jaeger/cmd/query/app.(*APIHandler).getOperationsLegacy
github.com/jaegertracing/jaeger/cmd/query/app/http_handler.go:187
net/http.HandlerFunc.ServeHTTP
net/http/server.go:2122
github.com/opentracing-contrib/go-stdlib/nethttp.MiddlewareFunc.func5
github.com/opentracing-contrib/go-stdlib@v1.0.0/nethttp/server.go:154
net/http.HandlerFunc.ServeHTTP
net/http/server.go:2122
net/http.HandlerFunc.ServeHTTP
net/http/server.go:2122
github.com/gorilla/mux.(*Router).ServeHTTP
github.com/gorilla/mux@v1.8.0/mux.go:210
github.com/jaegertracing/jaeger/cmd/query/app.additionalHeadersHandler.func1
github.com/jaegertracing/jaeger/cmd/query/app/additional_headers_handler.go:28
net/http.HandlerFunc.ServeHTTP
net/http/server.go:2122
github.com/gorilla/handlers.CompressHandlerLevel.func1
github.com/gorilla/handlers@v1.5.1/compress.go:141
net/http.HandlerFunc.ServeHTTP
net/http/server.go:2122
github.com/gorilla/handlers.recoveryHandler.ServeHTTP
github.com/gorilla/handlers@v1.5.1/recovery.go:78
net/http.serverHandler.ServeHTTP
net/http/server.go:2936
net/http.(*conn).serve
net/http/server.go:1995", "ts":1.6907950797421956E9}
which results in the same outcome (no traces visible in Jaeger, only error is shown). I still need to deep dive and investigate this - from the first glance, it seems that splits mentioned in error log are not present in a GCP bucket. There are no errors visible in Janitor (seems like its deleting files properly from GCS).
Any hint on what I could check? This might also be a completely different issue all together.
Any hint on what I could check? This might also be a completely different issue all together.
The bucket had a retention policy by default 🤦🏼 Of course the split is not found! 😄
So in my mind, Quickwit works perfectly fine atm! I'll remove the retention policy and let it run for a couple of days & report back!
@Tyrion85, thanks for the feedback. It could be the PostgreSQL metastore that improves the memory usage. I made a lot of experiments with the file backed metastore since 0.6 and never see such a memory usage though...
Closing.
Describe the bug This is regarding Quickwit as a storage for Jaeger traces and OTEL Collector.
Setup is in kubernetes, installed via helm chart. Metastore can sometimes get overwhelmed and crash, for example it can get OOMKilled. At that point, after Metastore recovers (more memory thrown at it), and when Indexer pod is manually restarted, all "previous" traces searched in Jaeger are lost. Data still exists in the bucket.
Not sure why this happens, or how to mitigate it. Is this an expected behaviour? A known issue? Is it a quirk of my setup? Although, I really don't have anything custom - just a bog standard installation of Quickwit via helm chart, OTEL Collector installed via helm chart, and Jaeger also installed via helm chart with pretty much defaults, except for connecting each other.
Steps to reproduce (if applicable) Steps to reproduce the behavior:
Some of the logs that can be observed:
[31mERROR[0m [2mquickwit_actors::spawn_builder[0m[2m:[0m actor-failure [3mcause[0m[2m=[0mAn IO error occurred: 'Directory kill switch was activated.' [3mexit_status[0m[2m=[0mFailure(An IO error occurred: 'Directory kill switch was activated.')
[2m2023-05-30T10:10:40.700Z[0m [31mERROR[0m [1mget_services[0m[2m:[0m[1mroot_list_terms[0m[2m:[0m [2mquickwit_search::root[0m[2m:[0m Leaf search response contains at least one failed split. [3mfailed_splits[0m[2m=[0m[SplitSearchError { error: "Internal error:
Failed to fetch hotcache and footer from s3://quickwit/indexes/otel-trace-v0 for split
01H1EKCGK65MKBKXAD3NYBEVZ9``.", split_id: "01H1EKCGK65MKBKXAD3NYBEVZ9", retryable_error: true }]