When it doesn't succeed, this test causes a hang for 60 seconds, after which I see Poll wft timeout printed three times in console, and then a WFT get polled and executed. Describing the TQ while the Worker is stuck shows no poller.
I believe this is due to the fact that the Resource Based Tuner is not aware of sticky vs non-sticky pollers distribution, and therefore, that it is possible to get in a situation where only sticky pollers receives permits at start. Given the default sticky to non-sticky ratio of 0.2, and three pollers (not sure why three, isn't the default two?), we get a probability of polling at least one task from the non-sticky TQ of 48.8% (i.e.: (80%^0 * 20%) + (80%^1 * 20%) + (80%^2 * 20%)), matching the observations.
Any of the following changes fixes (or at least improve) the test:
Increasing minimumSlots on workflowTaskSlotOptions;
Setting maxCachedWorkflows to 0;
Setting nonStickyToStickyPollRatio to 0.9;
Removing tunner from the Worker options.
Side note
We should maybe consider never to allow more permits for sticky TQ than there are cached workflow executions. In general production environment, that should not make a difference, but in tests and at start time, that might help workers get up to speed more quickly.
Describe the bug
This test from the TS SDK repo is very flaky (success rate of approx 50%):
When it doesn't succeed, this test causes a hang for 60 seconds, after which I see
Poll wft timeout
printed three times in console, and then a WFT get polled and executed. Describing the TQ while the Worker is stuck shows no poller.I believe this is due to the fact that the Resource Based Tuner is not aware of sticky vs non-sticky pollers distribution, and therefore, that it is possible to get in a situation where only sticky pollers receives permits at start. Given the default sticky to non-sticky ratio of 0.2, and three pollers (not sure why three, isn't the default two?), we get a probability of polling at least one task from the non-sticky TQ of 48.8% (i.e.:
(80%^0 * 20%) + (80%^1 * 20%) + (80%^2 * 20%)
), matching the observations.Any of the following changes fixes (or at least improve) the test:
minimumSlots
onworkflowTaskSlotOptions
;maxCachedWorkflows
to 0;nonStickyToStickyPollRatio
to0.9
;tunner
from the Worker options.Side note
We should maybe consider never to allow more permits for sticky TQ than there are cached workflow executions. In general production environment, that should not make a difference, but in tests and at start time, that might help workers get up to speed more quickly.