Secure Jsonnet Pool context deadline exceeded

Sytten commented 4 months ago

Preflight checklist

[X] I could not find a solution in the existing issues, docs, nor discussions.
[X] I agree to follow this project's Code of Conduct.
[X] I have read and am following this repository's Contribution Guidelines.
[ ] I have joined the Ory Community Slack.
[ ] I am signed up to the Ory Security Patch Newsletter.

Ory Network Project

No response

Describe the bug

We are trying to upgrade from commit https://github.com/ory/kratos/tree/0c5ea9bf735a67ef35011ba41d7f98afc6f8e118 to v1.1.

We get this error:

time=2024-05-15T17:00:51Z level=error msg=ExecutePostRegistrationPostPersistHook hook failed with an error. audience=application error=map[message:jsonnetsecure: eval: context deadline exceeded] executor=*hook.WebHook executor_position=1 executors=[*hook.Verifier *hook.WebHook] flow_method=password

The timeout is weird because it timeouts after a bit more than one second, our jsonnet is very simple:

function(ctx) { identity: ctx.identity }

It didn't fail locally on macOS, but it failed on our cloud provider so this suggests that there is something weird going with the new pool/vm system that was introduced in https://github.com/caido/kratos/compare/master...ory%3Akratos%3Amaster#diff-ab967ab1a2f3a1b769106eeb7bfe892ef0e81d1d27811fa15be08e6749feee1fR58-R77

The cloud machine is small, but I would not expect the jsonnet to take 1s to resolve. So maybe the timeout is warranted but somehow the pool is stuck.

Reproducing the bug

Use docker on linux
Setup kratos with a registration webhook using function(ctx) { identity: ctx.identity }
Try to register a new user

Relevant log output

time=2024-05-15T17:00:51Z level=error msg=ExecutePostRegistrationPostPersistHook hook failed with an error. audience=application error=map[message:jsonnetsecure: eval: context deadline exceeded] executor=*hook.WebHook executor_position=1 executors=[*hook.Verifier *hook.WebHook] flow_method=password

Relevant configuration

No response

Version

v1.1

On which operating system are you observing this issue?

Linux

In which environment are you deploying?

Docker

Additional Context

No response

Sytten commented 4 months ago

So this 1s seems hardcoded: https://github.com/ory/x/pull/755/files#diff-e8a7ac45033a2903ea00a56a5e7957e57ea7b60a3e9e7d4f23666f375fe2eec5R185

Now why does the machine take 1s to process it is kinda of the question. I am looking at my logs when using the previous versions and it doesn't take 1s to process the whole thing:

time=2024-05-15T17:54:39Z level=info msg=A new identity has registered using self-service registration. audience=audit 
time=2024-05-15T17:54:39Z level=info msg=Dispatching webhook audience=application host:cloud-backend:6000 method:POST path:/webhook/kratos/registration query:<nil> remote: scheme:http] s
time=2024-05-15T17:54:39Z level=info msg=[DEBUG] POST http://cloud-backend:6000/webhook/kratos/registration audience=application
time=2024-05-15T17:54:39Z level=info msg=completed handling request size:1043 status:200 text_status:OK took:575.330798ms]

Sytten commented 4 months ago

I think it would be nice to have the option to disable the pool and revert back to the older in-process VM? @alnr If you agree I am happy to push a PR to add that setting.

Sytten commented 4 months ago

It's weird that before it was still spawning a new process per execution so in theory a pool should just improve performances and it discards the idea that the platform is somehow not able to spawn processes. So this leaves a couple ideas:

The timeout is just too short since it includes the Acquire (https://github.com/ory/x/pull/755/files#diff-e8a7ac45033a2903ea00a56a5e7957e57ea7b60a3e9e7d4f23666f375fe2eec5R198) which means that if the pool for some reason is already in use it is very easy to go over the timeout (say one render takes 200ms and you have only one worker, it only takes 5 requests to go over the timeout).
The Acquire is not fair, that happens often in pool if it is not properly implemented. Some tasks end up stale longer.
The worker died or was stuck but the parent process was not notified

I think that option 3 is likely since this was happening in staging without any load (only one user trying to register). I am not sure how to get logs for this worker process though.

alnr commented 4 months ago

Not sure the process pool is actually active in the OSS release. Are you seeing persistent kratos jsonnet -0 processes?

alnr commented 4 months ago

We've been running the process pool in production for a while without any regressions but with 100x better performance.

alnr commented 4 months ago

Are you running your process with GOMAXPROCS=1 or similar perhaps?

Sytten commented 4 months ago

Not sure the process pool is actually active in the OSS release. Are you seeing persistent kratos jsonnet -0 processes?

I reverted but I will take a look, I am sure it is active since the "spawn process every time" option doesnt have a timeout AFAIK.

We've been running the process pool in production for a while without any regressions but with 100x better performance.

Interesting, it was very easy to trigger for us but its a super small machine since its a staging thing. What I am not sure is why it would work with a spawned thread but it would hang for 1s with the pool. Unless before it was running in-process?

Are you running your process with GOMAXPROCS=1 or similar perhaps?

No, but its a 0.25 CPU machine on render.com so similar outcome I am guessing.

ory / kratos