Closed Sytten closed 3 months ago
So this 1s seems hardcoded: https://github.com/ory/x/pull/755/files#diff-e8a7ac45033a2903ea00a56a5e7957e57ea7b60a3e9e7d4f23666f375fe2eec5R185
Now why does the machine take 1s to process it is kinda of the question. I am looking at my logs when using the previous versions and it doesn't take 1s to process the whole thing:
time=2024-05-15T17:54:39Z level=info msg=A new identity has registered using self-service registration. audience=audit
time=2024-05-15T17:54:39Z level=info msg=Dispatching webhook audience=application host:cloud-backend:6000 method:POST path:/webhook/kratos/registration query:<nil> remote: scheme:http] s
time=2024-05-15T17:54:39Z level=info msg=[DEBUG] POST http://cloud-backend:6000/webhook/kratos/registration audience=application
time=2024-05-15T17:54:39Z level=info msg=completed handling request size:1043 status:200 text_status:OK took:575.330798ms]
I think it would be nice to have the option to disable the pool and revert back to the older in-process VM? @alnr If you agree I am happy to push a PR to add that setting.
It's weird that before it was still spawning a new process per execution so in theory a pool should just improve performances and it discards the idea that the platform is somehow not able to spawn processes. So this leaves a couple ideas:
Acquire
(https://github.com/ory/x/pull/755/files#diff-e8a7ac45033a2903ea00a56a5e7957e57ea7b60a3e9e7d4f23666f375fe2eec5R198) which means that if the pool for some reason is already in use it is very easy to go over the timeout (say one render takes 200ms and you have only one worker, it only takes 5 requests to go over the timeout).Acquire
is not fair, that happens often in pool if it is not properly implemented. Some tasks end up stale longer.I think that option 3 is likely since this was happening in staging without any load (only one user trying to register). I am not sure how to get logs for this worker process though.
Not sure the process pool is actually active in the OSS release. Are you seeing persistent kratos jsonnet -0
processes?
We've been running the process pool in production for a while without any regressions but with 100x better performance.
Are you running your process with GOMAXPROCS=1
or similar perhaps?
Not sure the process pool is actually active in the OSS release. Are you seeing persistent
kratos jsonnet -0
processes?
I reverted but I will take a look, I am sure it is active since the "spawn process every time" option doesnt have a timeout AFAIK.
We've been running the process pool in production for a while without any regressions but with 100x better performance.
Interesting, it was very easy to trigger for us but its a super small machine since its a staging thing. What I am not sure is why it would work with a spawned thread but it would hang for 1s with the pool. Unless before it was running in-process?
Are you running your process with GOMAXPROCS=1 or similar perhaps?
No, but its a 0.25 CPU machine on render.com
so similar outcome I am guessing.
Preflight checklist
Ory Network Project
No response
Describe the bug
We are trying to upgrade from commit https://github.com/ory/kratos/tree/0c5ea9bf735a67ef35011ba41d7f98afc6f8e118 to v1.1.
We get this error:
The timeout is weird because it timeouts after a bit more than one second, our jsonnet is very simple:
It didn't fail locally on macOS, but it failed on our cloud provider so this suggests that there is something weird going with the new pool/vm system that was introduced in https://github.com/caido/kratos/compare/master...ory%3Akratos%3Amaster#diff-ab967ab1a2f3a1b769106eeb7bfe892ef0e81d1d27811fa15be08e6749feee1fR58-R77
The cloud machine is small, but I would not expect the jsonnet to take 1s to resolve. So maybe the timeout is warranted but somehow the pool is stuck.
Reproducing the bug
function(ctx) { identity: ctx.identity }
Relevant log output
Relevant configuration
No response
Version
v1.1
On which operating system are you observing this issue?
Linux
In which environment are you deploying?
Docker
Additional Context
No response