temporalio / sdk-core

Core Temporal SDK that can be used as a base for language specific Temporal SDKs
MIT License
266 stars 70 forks source link

[Bug] Resource Based Tuner is flaky - Does Tuner properly deal with sticky vs non-sticky pollers? #775

Open mjameswh opened 2 months ago

mjameswh commented 2 months ago

Describe the bug

This test from the TS SDK repo is very flaky (success rate of approx 50%):

test('Worker can run with resource based tuner', async (t) => {
  const { createWorker, executeWorkflow } = helpers(t);
  const resourceBasedTunerOptions: ResourceBasedTunerOptions = {
    targetCpuUsage: 0.6,
    targetMemoryUsage: 0.6,
  };
  const worker = await createWorker({
    tuner: {
      tunerOptions: resourceBasedTunerOptions,
      activityTaskSlotOptions: {
        minimumSlots: 2,
        maximumSlots: 10,
        rampThrottle: 20,
      },
    },
  });
  const result = await worker.runUntil(executeWorkflow(successString));
  t.is(result, 'success');
});

When it doesn't succeed, this test causes a hang for 60 seconds, after which I see Poll wft timeout printed three times in console, and then a WFT get polled and executed. Describing the TQ while the Worker is stuck shows no poller.

I believe this is due to the fact that the Resource Based Tuner is not aware of sticky vs non-sticky pollers distribution, and therefore, that it is possible to get in a situation where only sticky pollers receives permits at start. Given the default sticky to non-sticky ratio of 0.2, and three pollers (not sure why three, isn't the default two?), we get a probability of polling at least one task from the non-sticky TQ of 48.8% (i.e.: (80%^0 * 20%) + (80%^1 * 20%) + (80%^2 * 20%)), matching the observations.

Any of the following changes fixes (or at least improve) the test:

Side note

We should maybe consider never to allow more permits for sticky TQ than there are cached workflow executions. In general production environment, that should not make a difference, but in tests and at start time, that might help workers get up to speed more quickly.

Sushisource commented 2 months ago

Thanks @mjameswh , very useful info. Will incorporate as soon as I get a chance.