Webhook failing with timeout while it never failed before

hugotiburtino commented 1 year ago

Preflight checklist

[X] I could not find a solution in the existing issues, docs, nor discussions.
[X] I agree to follow this project's Code of Conduct.
[X] I have read and am following this repository's Contribution Guidelines.
[ ] This issue affects my Ory Network project.
[X] I have joined the Ory Community Slack.
[X] I am signed up to the Ory Security Patch Newsletter.

Describe the bug

When our kratos was deployed on this custom commit the post registration webhook never failed and that was from 12 December 2022 until 8 April. Impressive 1461 times in which it always worked.

Since 23 April we are using the version 0.13.0 and it failed 2 times out of 60.

For more info see https://github.com/serlo/infrastructure-modules-shared/issues/43

Maybe side effect of https://github.com/ory/kratos/pull/3200

Reproducing the bug

It is hard to reproduce, even though I think some actions can be taken, like reverting it to the previous behavior without timeout. And is it doing retry?

Relevant log output

Both had timeout

"time=2023-04-24T14:29:35Z level=error msg=Webhook request failed audience=application duration=5m0.00090309s error=map[message:context deadline exceeded] otel=map[span_id:0000000000000000 trace_id:00000000000000000000000000000000] service_name=Ory Kratos service_version=v0.13.0"

Nevertheless, I doubt that the response took so long from the webhook endpoint.

Relevant configuration

      registration:
          lifespan: 10m
          ui_url: https://serlo-staging.dev/auth/registration
          after:
            hooks:
              - hook: web_hook
                config:
                  url: https://api.serlo-staging.dev/kratos/register
                  method: POST
                  # function (ctx) { userId: ctx.identity.id }
                  body: base64://ZnVuY3Rpb24gKGN0eCkgeyB1c2VySWQ6IGN0eC5pZGVudGl0eS5pZCB9Cg==
                  auth:
                    type: api_key
                    config:
                      name: x-kratos-key
                      value: ${kratos_secret}
                      in: header

See https://github.com/serlo/infrastructure-modules-shared/blob/main/kratos/values.yaml#L99-L115

Version

0.13.0

In which environment are you deploying?

Kubernetes with Helm

aeneasr commented 1 year ago

@hperl @zepatrik

hugotiburtino commented 1 year ago

Increasing the memory didn't work. Indeed we got the worst mark after it: 13 failures out of 25 attempts in just one day.

aeneasr commented 1 year ago

Do you have any other information why you reach a network timeout? What does your webhook target say?

hugotiburtino commented 1 year ago

Thanks for the questions.

What does your webhook target say?

We will have to add a logger in the endpoint first. After that,I will put here our findings.

hugotiburtino commented 1 year ago

It is now deployed in our production environment, let's see. I will write here as soon as I get relevant info

hugotiburtino commented 1 year ago

So far, the webhook is not failing. For that we increased the memory. When I reported this issue, our kratos instance was working with 256Mi. It worked fine before the upgrade, even though it wasnot that much memory. Then, we increased to 320Mi and the problem was still there. Now, we have 400Mi and seems to work.

No matter how much memory it has, we can still see such peaks As you can see, the curve up started at 8:45 and stopped at 8:50, so the 5 minutes timeout, eventhough I couldn't find the webhook being triggered in the logs.

It seems there is a memory leak in kratos and it's maybe related to this timeout, can it be?

aeneasr commented 1 year ago

There is currently a memory leak in Ory Kratos. It will be resolved with the next release!

lermana commented 1 year ago

I wanted to share a recent and critical issue we've been experiencing that may be related to this memory leak issue.

We have a service that makes requests to the Ory Kratos sessions/whoami route. Both our service (let's call it "System") and Kratos are running on the same node in a Kubernetes cluster. We've bumped into an issue where System starts to throw what look like packet-parsing errors, and only in the context of handling the requests that come back from Kratos.

Here's what our System logs look like during such an error:

[2023-06-01T20:24:30Z INFO  middleware::authentication] Making request to Ory at "https://********/sessions/whoami"
[2023-06-01T20:24:30Z DEBUG ureq::stream] connecting to ********:443 at *.*.*.*:443
[2023-06-01T20:24:30Z DEBUG rustls::client::hs] Resuming session
[2023-06-01T20:24:30Z DEBUG rustls::client::hs] ALPN protocol is None
[2023-06-01T20:24:30Z DEBUG rustls::client::hs] Using ciphersuite TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
[2023-06-01T20:24:30Z DEBUG rustls::client::tls12::server_hello] Server agreed to resume
[2023-06-01T20:24:30Z WARN  rustls::conn] Sending fatal alert BadRecordMac
[2023-06-01T20:24:30Z WARN  middleware::authentication] Could not make request to Ory
[2023-06-01T20:24:30Z DEBUG ureq::stream] dropping stream: Stream(RustlsStream)

Note that we make a request over public DNS, even though Kratos and System are running on the same node, because we have only exposed the "public" Kratos API publicly.

Once this happens, all responses from Kratos to System yield the same error, and at that point we just restart System, which typically alleviates the issue right away. We've done some digging into our side of things: for instance, we added code to also make a GET request to Google every time System makes the above GET to Kratos; the requests to Google keep working ok, even when System is unable to process the requests coming back from Kratos (i.e. System is definitely able, even in these moments, to handle at least some HTTP responses). Note that, even when the Kratos<>System communication is failing, we can engage Kratos from other clients.

Here are some logs from Kratos during one of these erroneous requests -- everything looks fine, which is weird!

time=2023-06-01T16:26:17Z
level=info
msg=completed
handling request
http_request=map[headers:map[accept:application/json, text/plain, */* cookie:Value is sensitive and has been redacted. To see the value set config key "log.leak_sensitive_values = true" or environment variable "LOG_LEAK_SENSITIVE_VALUES=true". user-agent:axios/*.*.* x-forwarded-for:*.*.*.* x-forwarded-host:******** x-forwarded-port:80 x-forwarded-proto:http x-forwarded-scheme:http x-real-ip:*.*.*.* x-request-id:******** x-scheme:http] host:******** method:GET path:/sessions/whoami query:<nil> remote:*.*.*.*:58522 scheme:http] http_response=map[headers:map[cache-control:private, no-cache, no-store, must-revalidate content-type:application/json; charset=utf-8 vary:Origin x-kratos-authenticated-identity-id:********] size:1446 status:200 text_status:OK took:40.459424ms]

A few things are making us think there could be something going wrong in Kratos in these moments, including:

as mentioned, System can make external requests elsewhere just fine, even while it can't handle the Kratos responses
we've noticed that Kratos is hogging A LOT of memory on the node -- for context, System (which does some complex work around access control and encryption) is taking about 800MB, while Kratos at last check was taking 6GB
we can force the above Sending fatal alert BadRecordMac error by purposefully memory-constraining the node that System and Kratos run on
System is written in Rust and the HTTP request to Kratos is happening entirely within a single function call, i.e. new connection every call / very simple setup; while bugs are always possible, Rust has strict destructor implementations and we're leveraging a widely used HTTP library to make these calls. We have considered a DNS-caching or dangling-connection issue on our end (b/c there is some weird, ephemeral cloud-network stuff that happens in this cluster) but we are feeling at least somewhat confident that System is not the cause of the issue.

Thinking on 4. -- particularly that System is opening a new HTTP connection on every request to Kratos -- we're wondering if it's possible that when Kratos is memory-constrained, maybe it starts leaking connections? To express the idea in a bit more detail: maybe when Kratos is under memory duress, it stops closing certain connections properly, which eventually leaves it with just too many open connections to a particular client; at this point, responses to that client start going out jarbled.

I wish we could have gotten to a bit more confidence in our assessment but this has been vexing us for a couple days now. Given that:

there's a known memory leak in Kratos
we're seeing what looks like suspicious memory consumption in Kratos
we can force the issue through constraining memory (while other service activity on the node is still working fine)

we figured it was worth posting this in case it's relevant - would greatly appreciate any help you can offer!

aeneasr commented 1 year ago

Thank you for the report! The memory leak is already, I believe, fixed on master. We'll release a patch for it soon!

lermana commented 1 year ago

no problem, and thanks @aeneasr - that's great to hear

nicolasburtey commented 1 year ago

some more memory leaks issue we're seeing on our end

aeneasr commented 1 year ago

If you can deploy from master, the mem leak is gone there.

nicolasburtey commented 1 year ago

which commit is fixing the mem leak?

getting master feels like it come with risking other regressions, so cherry-picking would be safer option

alnr commented 1 year ago

Leak should be fixed in kratos v1.0. Please reopen if this still happens 👍

hugotiburtino commented 7 months ago

Hi, I can't reopen the issue, but it is still hapenning.

alnr commented 7 months ago

Hey. We'd need a minimal reproducible example with Kratos 1.1 to figure out of there is a problem 👍

Sytten commented 6 months ago

We are also experiencing problems in v1.1 @alnr upgrading from v1, but this is a different problem. See https://github.com/ory/kratos/issues/3923

ory / kratos