temporalio / sdk-java

Temporal Java SDK
https://temporal.io
Apache License 2.0
211 stars 142 forks source link

Querying a running workflow makes it spam previously logged messages #2076

Closed rorueda closed 4 months ago

rorueda commented 4 months ago

Expected Behavior

Querying a workflow doesn't generate additional logs.

Actual Behavior

Querying a workflow generates additional logs.

Steps to Reproduce the Problem

I added an additional test to DirectQueryReplaysDontSpamLogWithWorkflowExecutionExceptionsTest.java to reproduce the issue.

@Rule
public SDKTestWorkflowRule testWorkflowRule =
    SDKTestWorkflowRule.newBuilder()
        .setWorkflowTypes(TestWorkflowNonRetryableFlag.class, LogAndKeepRunningWorkflow.class)
        .setActivityImplementations(new TestActivities.TestActivitiesImpl())
        .build();

@Test
public void queriedWorkflowFailureDoesntProduceAdditionalLogsWhenWorkflowIsNotCompleted()
    throws InterruptedException {
  TestWorkflows.QueryableWorkflow workflow =
      testWorkflowRule.newWorkflowStub(TestWorkflows.QueryableWorkflow.class);

  WorkflowExecution execution = WorkflowClient.start(workflow::execute);
  Thread.sleep(500);
  assertEquals(
      "Workflow execution exception should be logged",
      1,
      workflowExecuteRunnableLoggerAppender.list.size());

  TestWorkflows.QueryableWorkflow queryStub =
      testWorkflowRule
          .getWorkflowClient()
          .newWorkflowStub(TestWorkflows.QueryableWorkflow.class, execution.getWorkflowId());
  assertEquals("my-state", queryStub.getState());
  assertEquals("There was only one execution.", 1, workflowCodeExecutionCount.get());
  assertEquals(
      "Only the original exception should be logged.",
      1,
      workflowExecuteRunnableLoggerAppender.list.size());

  testWorkflowRule.invalidateWorkflowCache();
  assertEquals("my-state", queryStub.getState());
  assertEquals(
      "There was two executions - one original and one full replay for query.",
      2,
      workflowCodeExecutionCount.get());
  assertEquals(
      "Only the original exception should be logged.",
      1,
      workflowExecuteRunnableLoggerAppender.list.size());

  queryStub.mySignal("exit");
  assertEquals("my-state", queryStub.getState());
  assertEquals(
      "There was three executions - one original and two full replays for query.",
      3,
      workflowCodeExecutionCount.get());
  assertEquals(
      "Only the original exception should be logged.",
      1,
      workflowExecuteRunnableLoggerAppender.list.size());
}

public static class LogAndKeepRunningWorkflow implements TestWorkflows.QueryableWorkflow {
  private final org.slf4j.Logger logger;
  private final TestActivities.VariousTestActivities activities;
  private boolean exit;

  public LogAndKeepRunningWorkflow() {
    activities =
        Workflow.newActivityStub(
            TestActivities.VariousTestActivities.class,
            ActivityOptions.newBuilder()
                .setStartToCloseTimeout(Duration.ofSeconds(10))
                .setRetryOptions(RetryOptions.newBuilder().setMaximumAttempts(1).build())
                .build());
    logger = Workflow.getLogger("io.temporal.internal.sync.WorkflowExecutionHandler");
  }

  @Override
  public String execute() {
    workflowCodeExecutionCount.incrementAndGet();
    while (true) {
      try {
        activities.throwIO();
      } catch (ActivityFailure e) {
        logger.error("Unexpected error on activity", e);
        Workflow.await(() -> exit);
        if (exit) {
          return "exit";
        }
      }
    }
  }

  @Override
  public String getState() {
    return "my-state";
  }

  @Override
  public void mySignal(String value) {
    exit = true;
  }
}

I found that the logs are generated because replaying is being set to false here: https://github.com/temporalio/sdk-java/blob/5e5cf0bf2db695cba46d1855959e02b4cdfa3cff/temporal-sdk/src/main/java/io/temporal/internal/statemachines/WorkflowStateMachines.java#L456-L462

I saw that there were changes to those lines recently, but I've tested with 1.23.1, 1.23.2 and master and the issue is present in all of them. I first noticed the issue in production when upgrading from 1.19.1 to 1.23.1.

Specifications

Quinn-With-Two-Ns commented 4 months ago

This is already fixed by https://github.com/temporalio/sdk-java/commit/a41c64ece3876f584ebde70903668b817c5455d8 and will be included in the next SDK release

rorueda commented 4 months ago

I can't rule out my test being wrong, but I have just tested on top of the current last commit https://github.com/temporalio/sdk-java/commit/5e5cf0bf2db695cba46d1855959e02b4cdfa3cff and it still fails. Any idea why?

Quinn-With-Two-Ns commented 4 months ago

hm let me try to get your test running on my machine