When Metastore crashes, historical traces are no longer visible in Jaeger

Tyrion85 commented 1 year ago

Describe the bug This is regarding Quickwit as a storage for Jaeger traces and OTEL Collector.

Setup is in kubernetes, installed via helm chart. Metastore can sometimes get overwhelmed and crash, for example it can get OOMKilled. At that point, after Metastore recovers (more memory thrown at it), and when Indexer pod is manually restarted, all "previous" traces searched in Jaeger are lost. Data still exists in the bucket.

Not sure why this happens, or how to mitigate it. Is this an expected behaviour? A known issue? Is it a quirk of my setup? Although, I really don't have anything custom - just a bog standard installation of Quickwit via helm chart, OTEL Collector installed via helm chart, and Jaeger also installed via helm chart with pretty much defaults, except for connecting each other.

Steps to reproduce (if applicable) Steps to reproduce the behavior:

Install Quickwit helm chart o k8s, install OTEL collector, install Jaeger
Use default values for Quickwit

Relevant OTEL Collector config:

exporters:
  jaeger:
    endpoint: jaeger-collector:14250
    tls:
      insecure: true
  otlp/quickwit:
    endpoint: quickwit-indexer:7281
    tls:
      insecure: true
service:
  extensions: [ health_check ]
  pipelines:
    traces:
      receivers: [ otlp,opencensus,zipkin,jaeger ]
      processors: [ batch ]
      exporters: [ jaeger,otlp/quickwit ]

Relevant Jaeger config:

storage:
type: grpc-plugin
query:
extraEnv:
- name: GRPC_STORAGE_SERVER
  value: quickwit-searcher:7281
collector:
extraEnv:
- name: GRPC_STORAGE_SERVER
  value: quickwit-searcher:7281

Make sure there are some traces stored in a bucket (I use GCP). Make sure traces are searchable in Jaeger
Make sure metastore crashes. Recover it, and restart indexer (for some reason, in my setup it cannot recover on its own and needs manual intervention)
Search for "old" traces. In my case, they are no longer available, but data is still present in the bucket.

Some of the logs that can be observed:

indexer:


[2m2023-05-30T10:10:20.995Z[0m [31mERROR[0m [2mquickwit_actors::spawn_builder[0m[2m:[0m actor-failure [3mcause[0m[2m=[0mFailed to receive command from uploader.

[31mERROR[0m [2mquickwit_actors::spawn_builder[0m[2m:[0m actor-failure [3mcause[0m[2m=[0mAn IO error occurred: 'Directory kill switch was activated.' [3mexit_status[0m[2m=[0mFailure(An IO error occurred: 'Directory kill switch was activated.')


Searcher:

[2m2023-05-30T10:10:40.700Z[0m [31mERROR[0m [1mget_services[0m[2m:[0m[1mroot_list_terms[0m[2m:[0m [2mquickwit_search::root[0m[2m:[0m Leaf search response contains at least one failed split. [3mfailed_splits[0m[2m=[0m[SplitSearchError { error: "Internal error: Failed to fetch hotcache and footer from s3://quickwit/indexes/otel-trace-v0 for split01H1EKCGK65MKBKXAD3NYBEVZ9``.", split_id: "01H1EKCGK65MKBKXAD3NYBEVZ9", retryable_error: true }]



**Expected behavior**
Metastore can crash from time to time, this is normal in cloud native world. Expected behaviour is for old traces to be searchable, as they were before the metastore crashed.

**Configuration:**
Given above

Quickwith helm chart **0.3.4**
Image: **quickwit/quickwit:v0.5.0** 
Let me know if I can provide any more detail here 🙏🏼

fmassot commented 1 year ago

Thanks for the report @Tyrion85

Are you using a file-backed metastore (on s3?) or a postgresql metastore?

Tyrion85 commented 1 year ago

sorry, should've mentioned that part, my bad. Its a file-backed metastore on GCP, afaik underlying code treats it as "s3" and this is the config I use, but its a GCP bucket.

I've just reproduced this again, I added more memory to Metastore pod, which obviously resulted in creation of new pod, and I "lost" all traces that were visible moments before adding more memory.

Tyrion85 commented 1 year ago

I obviously don't know anything about the source code, so making a bunch of guesses here - maybe its not the issue with Metastore per se, but with the fact I need to restart indexer manually when Metastore crashes? I don't have a persistent volume for indexer, because by default its turned off in helm chart values. Should I have a persistent volume for indexer? Maybe this is causing the "loss" of data?

fmassot commented 1 year ago

Ok, this is definitely a bug and the metastore should not behave like this.

How many splits do you have? What is your memory limit on the metastore pod?

The problem with the file-backed metastore is that.. is works with only one writer, which should be the metastore pod. It's possible to get some unexpected behavior when you have two metastore pods that are writing to the same metastore file (which should be at indexes/my-index/metastore.json). For example, a possible scenario is: a metastore pod add a new split in the metastore.json and make a put to the object storage. But the other metastore pod could do the same thing... so you can lose a new split for example or... if there is a merge operation that has just finished, you can lose that too.

I can try to reproduce this behavior on our EKS cluster.

Should I have a persistent volume for indexer?

You can lose some data yes. But it's data that is not committed to the metastore. For the trace index, we set a variable called commit_timeout to 5s which means that each 5 second, incoming data are packaged and publish to the metastore (this phase could take a few seconds). So you can lose 10 to 20 seconds of data.

But you should not lose already committed data (= splits present in the metastore).

If you can share the metastore.json in a DM on discord I can investigate more on this to understand why your metastore is taking too much memory.

Tyrion85 commented 1 year ago

How many splits do you have? What is your memory limit on the metastore pod?

Around 5k+ splits (objects in a bucket). Memory limit was 1Gi and it was getting exceeded, it is now bumped to 3Gi.

It's possible to get some unexpected behavior when you have two metastore pods that are writing to the same metastore file (which should be at indexes/my-index/metastore.json).

If there's an issue with two Metastore pods coexisting, maybe we should switch from k8s Deployment strategy.type RollingUpdate to Recreate, or use a StatefulSet instead, to ensure only one pod exists? As it is right now with RollingUpdate in my current setup (default in helm chart), this can cause an issue every time podTemplate changes for any reason I guess.

If you can share the metastore.json in a DM on discord I can investigate more on this to understand why your metastore is taking too much memory.

I'll try to do that, thank you! 🙏🏼 In the meantime, I did notice that the vast majority of that metastore.json file has split_state=MarkedForDeletion - only two entries are Published. This probably relates to https://github.com/quickwit-oss/quickwit/issues/3258 ? Could it be that I shouldn't even see "old data" because it is already MarkedForDeletion, but I see it by accident because janitor cannot delete stuff from GCP bucket? Although, there are only about 1k splits in that file, and there are 5k+ files in a bucket.

fmassot commented 1 year ago

@Tyrion85 I tried to reproduce your issue with the 0.6 version with thousands of splits. The memory taken by the metastore remains super low (under 200MB). The 0.6 should fix the not garbage collected MarkedForDeletion splits.

To avoid two file-backed metastore writers, we need to fix the helm chart, I will open an issue on the helm repo.

Can you try to update to 0.6 and monitor the metastore memory usage to see if it's stable?

Tyrion85 commented 1 year ago

@fmassot sorry for the late response, finally got some time to devote to this 🙏

Since the upgrade to 0.6, memory usage for metastore hovers around 9MB! 🎉 Amazing improvement is visible in this chart:

I also switched from file-backed metastore to postgres-backed metastore, and things have been much more stable since!

However, there are still errors like this one

Failed to fetch hotcache and footer from s3://.../indexes/otel-traces-v0_6 for split `01H6F9VP5YM9GGQG6FXW7MCMXR``., split_id: 01H6F9VP5YM9GGQG6FXW7MCMXR), (Internal error: `Failed to fetch hotcache and footer from s3://.../indexes/otel-traces-v0_6 for split `01H6FYNZQKDKGFCBTTSDPZWCYD``., split_id: 01H6FYNZQKDKGFCBTTSDPZWCYD), (Internal error: `Failed to fetch hotcache and footer from s3://.../indexes/otel-traces-v0_6 for split `01H6FEK42XJAJ7VVK6QVRT75VT``., split_id: 01H6FEK42XJAJ7VVK6QVRT75VT)`.", "level":"error", "msg":"HTTP handler, Internal Server Error", "stacktrace":"github.com/jaegertracing/jaeger/cmd/query/app.(*APIHandler).handleError
    github.com/jaegertracing/jaeger/cmd/query/app/http_handler.go:494
github.com/jaegertracing/jaeger/cmd/query/app.(*APIHandler).getOperationsLegacy
    github.com/jaegertracing/jaeger/cmd/query/app/http_handler.go:187
net/http.HandlerFunc.ServeHTTP
    net/http/server.go:2122
github.com/opentracing-contrib/go-stdlib/nethttp.MiddlewareFunc.func5
    github.com/opentracing-contrib/go-stdlib@v1.0.0/nethttp/server.go:154
net/http.HandlerFunc.ServeHTTP
    net/http/server.go:2122
net/http.HandlerFunc.ServeHTTP
    net/http/server.go:2122
github.com/gorilla/mux.(*Router).ServeHTTP
    github.com/gorilla/mux@v1.8.0/mux.go:210
github.com/jaegertracing/jaeger/cmd/query/app.additionalHeadersHandler.func1
    github.com/jaegertracing/jaeger/cmd/query/app/additional_headers_handler.go:28
net/http.HandlerFunc.ServeHTTP
    net/http/server.go:2122
github.com/gorilla/handlers.CompressHandlerLevel.func1
    github.com/gorilla/handlers@v1.5.1/compress.go:141
net/http.HandlerFunc.ServeHTTP
    net/http/server.go:2122
github.com/gorilla/handlers.recoveryHandler.ServeHTTP
    github.com/gorilla/handlers@v1.5.1/recovery.go:78
net/http.serverHandler.ServeHTTP
    net/http/server.go:2936
net/http.(*conn).serve
    net/http/server.go:1995", "ts":1.6907950797421956E9}

which results in the same outcome (no traces visible in Jaeger, only error is shown). I still need to deep dive and investigate this - from the first glance, it seems that splits mentioned in error log are not present in a GCP bucket. There are no errors visible in Janitor (seems like its deleting files properly from GCS).

Any hint on what I could check? This might also be a completely different issue all together.

Tyrion85 commented 1 year ago

Any hint on what I could check? This might also be a completely different issue all together.

The bucket had a retention policy by default 🤦🏼 Of course the split is not found! 😄

So in my mind, Quickwit works perfectly fine atm! I'll remove the retention policy and let it run for a couple of days & report back!

fmassot commented 1 year ago

@Tyrion85, thanks for the feedback. It could be the PostgreSQL metastore that improves the memory usage. I made a lot of experiments with the file backed metastore since 0.6 and never see such a memory usage though...

fmassot commented 1 year ago

Closing.

quickwit-oss / quickwit

When Metastore crashes, historical traces are no longer visible in Jaeger #3442