Closed hugotiburtino closed 1 year ago
@hperl @zepatrik
Increasing the memory didn't work. Indeed we got the worst mark after it: 13 failures out of 25 attempts in just one day.
Do you have any other information why you reach a network timeout? What does your webhook target say?
Thanks for the questions.
What does your webhook target say?
We will have to add a logger in the endpoint first. After that,I will put here our findings.
It is now deployed in our production environment, let's see. I will write here as soon as I get relevant info
So far, the webhook is not failing. For that we increased the memory. When I reported this issue, our kratos instance was working with 256Mi. It worked fine before the upgrade, even though it wasnot that much memory. Then, we increased to 320Mi and the problem was still there. Now, we have 400Mi and seems to work.
No matter how much memory it has, we can still see such peaks As you can see, the curve up started at 8:45 and stopped at 8:50, so the 5 minutes timeout, eventhough I couldn't find the webhook being triggered in the logs.
It seems there is a memory leak in kratos and it's maybe related to this timeout, can it be?
There is currently a memory leak in Ory Kratos. It will be resolved with the next release!
I wanted to share a recent and critical issue we've been experiencing that may be related to this memory leak issue.
We have a service that makes requests to the Ory Kratos sessions/whoami
route. Both our service (let's call it "System") and Kratos are running on the same node in a Kubernetes cluster. We've bumped into an issue where System starts to throw what look like packet-parsing errors, and only in the context of handling the requests that come back from Kratos.
Here's what our System logs look like during such an error:
[2023-06-01T20:24:30Z INFO middleware::authentication] Making request to Ory at "https://********/sessions/whoami"
[2023-06-01T20:24:30Z DEBUG ureq::stream] connecting to ********:443 at *.*.*.*:443
[2023-06-01T20:24:30Z DEBUG rustls::client::hs] Resuming session
[2023-06-01T20:24:30Z DEBUG rustls::client::hs] ALPN protocol is None
[2023-06-01T20:24:30Z DEBUG rustls::client::hs] Using ciphersuite TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
[2023-06-01T20:24:30Z DEBUG rustls::client::tls12::server_hello] Server agreed to resume
[2023-06-01T20:24:30Z WARN rustls::conn] Sending fatal alert BadRecordMac
[2023-06-01T20:24:30Z WARN middleware::authentication] Could not make request to Ory
[2023-06-01T20:24:30Z DEBUG ureq::stream] dropping stream: Stream(RustlsStream)
Note that we make a request over public DNS, even though Kratos and System are running on the same node, because we have only exposed the "public" Kratos API publicly.
Once this happens, all responses from Kratos to System yield the same error, and at that point we just restart System, which typically alleviates the issue right away. We've done some digging into our side of things: for instance, we added code to also make a GET
request to Google every time System makes the above GET
to Kratos; the requests to Google keep working ok, even when System is unable to process the requests coming back from Kratos (i.e. System is definitely able, even in these moments, to handle at least some HTTP responses). Note that, even when the Kratos<>System communication is failing, we can engage Kratos from other clients.
Here are some logs from Kratos during one of these erroneous requests -- everything looks fine, which is weird!
time=2023-06-01T16:26:17Z
level=info
msg=completed
handling request
http_request=map[headers:map[accept:application/json, text/plain, */* cookie:Value is sensitive and has been redacted. To see the value set config key "log.leak_sensitive_values = true" or environment variable "LOG_LEAK_SENSITIVE_VALUES=true". user-agent:axios/*.*.* x-forwarded-for:*.*.*.* x-forwarded-host:******** x-forwarded-port:80 x-forwarded-proto:http x-forwarded-scheme:http x-real-ip:*.*.*.* x-request-id:******** x-scheme:http] host:******** method:GET path:/sessions/whoami query:<nil> remote:*.*.*.*:58522 scheme:http] http_response=map[headers:map[cache-control:private, no-cache, no-store, must-revalidate content-type:application/json; charset=utf-8 vary:Origin x-kratos-authenticated-identity-id:********] size:1446 status:200 text_status:OK took:40.459424ms]
A few things are making us think there could be something going wrong in Kratos in these moments, including:
Sending fatal alert BadRecordMac
error by purposefully memory-constraining the node that System and Kratos run onThinking on 4.
-- particularly that System is opening a new HTTP connection on every request to Kratos -- we're wondering if it's possible that when Kratos is memory-constrained, maybe it starts leaking connections? To express the idea in a bit more detail: maybe when Kratos is under memory duress, it stops closing certain connections properly, which eventually leaves it with just too many open connections to a particular client; at this point, responses to that client start going out jarbled.
I wish we could have gotten to a bit more confidence in our assessment but this has been vexing us for a couple days now. Given that:
we figured it was worth posting this in case it's relevant - would greatly appreciate any help you can offer!
Thank you for the report! The memory leak is already, I believe, fixed on master. We'll release a patch for it soon!
no problem, and thanks @aeneasr - that's great to hear
some more memory leaks issue we're seeing on our end
If you can deploy from master, the mem leak is gone there.
which commit is fixing the mem leak?
getting master feels like it come with risking other regressions, so cherry-picking would be safer option
Leak should be fixed in kratos v1.0. Please reopen if this still happens 👍
Hi, I can't reopen the issue, but it is still hapenning.
Hey. We'd need a minimal reproducible example with Kratos 1.1 to figure out of there is a problem 👍
We are also experiencing problems in v1.1 @alnr upgrading from v1, but this is a different problem. See https://github.com/ory/kratos/issues/3923
Preflight checklist
Describe the bug
When our kratos was deployed on this custom commit the post registration webhook never failed and that was from 12 December 2022 until 8 April. Impressive 1461 times in which it always worked.
Since 23 April we are using the version 0.13.0 and it failed 2 times out of 60.
For more info see https://github.com/serlo/infrastructure-modules-shared/issues/43
Maybe side effect of https://github.com/ory/kratos/pull/3200
Reproducing the bug
It is hard to reproduce, even though I think some actions can be taken, like reverting it to the previous behavior without timeout. And is it doing retry?
Relevant log output
Both had timeout
Nevertheless, I doubt that the response took so long from the webhook endpoint.
Relevant configuration
See https://github.com/serlo/infrastructure-modules-shared/blob/main/kratos/values.yaml#L99-L115
Version
0.13.0
In which environment are you deploying?
Kubernetes with Helm