Closed dev-boyenn closed 9 months ago
Hi @dev-boyenn , Thanks for reporting this. We will make this configurable. In the mean time, you can check here https://github.com/orkes-io/orkes-conductor-community/blob/60325ef7b196a96d1062ddfecf924c4be7866309/server/src/main/java/io/orkes/conductor/server/service/OrkesWorkflowSweeper.java#L188 We can set the workflow offset timeout to be the higher value so that workflows wont get sweep frequently.
Hi @manan164 , thanks for the quick response !
What does the workflow offset timeout actually mean ?
I tried changing conductor.app.workflowOffsetTimeout=300
as a quick test, and I'm noticing no considerable reduction in how much workflows are being swept
Hi @dev-boyenn , Are all the running workflows being updated in intervals of 60 milliseconds? Let's chat here in real-time
Is there a pull for this to change to 60_000L
?
I found the problem, it is not a high CPU caused by Redis, but it is because the sweeper is using a while (true) {...}
, when the @Component is not started, it is constantly running. I will raise a PR for this.
In our production use-case, we often have long running workflows that wait on human tasks. Because we want to be able to track human tasks in our own backoffice systems, we created a subworkflow that creates and tracks human tasks for us and ends with a HUMAN task in conductor. We noticed an absurd load on REDIS, even when every single currently non-completed workflow is idling on a subworkflow that's idling on a HUMAN task. Looking into it more we noticed that our logs are getting spammed with
INFO [sweeper-thread-1] io.orkes.conductor.server.service.OrkesWorkflowSweeper: Running sweeper for workflow ***
. This constantly fetches the workflows and its tasks, and it seems like it is currently impossible to slow this process down.Looking into the contradictory statements of this code and it's comment : https://github.com/orkes-io/orkes-conductor-community/blob/60325ef7b196a96d1062ddfecf924c4be7866309/server/src/main/java/io/orkes/conductor/server/service/OrkesWorkflowSweeper.java#L152C4-L152C4 ( Comment says 60 seconds, code is 60 milis ) , I'm worried a mistake might have been made in the implementation of the sweeper service, and workflows are being checked way more often than they should be.
I believe this to be a root cause of our production systems failing under relatively light load. Is there any way to slow down the sweeper without disabling it completely, or does a bug need to be fixed?