uber / cadence-java-client

Java framework for Cadence Workflow Service
https://cadenceworkflow.io
Other
143 stars 106 forks source link

Activity hanging in case of dataconverter failed #375

Closed szaluzhskiy closed 5 years ago

szaluzhskiy commented 5 years ago

I've faced with strange behaviour. Custom data converter is used. And this converter throws exception when deserializing data between workflow and activity. Then workflow stucked in blocked state and waiting for "nothing".

Exception I see in cadence-web UI

workflow-root: (BLOCKED on Feature.get)
com.uber.cadence.internal.sync.CompletablePromiseImpl.get(CompletablePromiseImpl.java:71)
com.uber.cadence.internal.sync.ChildWorkflowStubImpl.execute(ChildWorkflowStubImpl.java:77)
com.uber.cadence.internal.sync.ChildWorkflowInvocationHandler.invoke(ChildWorkflowInvocationHandler.java:73)
com.sun.proxy.$Proxy225.process(Unknown Source)
com.example.common.claim.service.impl.FbClaimServiceImpl.executeWorkflow(FbClaimServiceImpl.java:46)
com.example.common.claim.service.impl.FbClaimServiceImpl.handle(FbClaimServiceImpl.java:36)
com.example.common.claim.config.cadence.ProcessCommonClaimWorkflowImpl.process(ProcessCommonClaimWorkflowImpl.java:22)
sun.reflect.GeneratedMethodAccessor176.invoke(Unknown Source)

Exception in worker

2019.09.04 14:34:56.387 [null PROPERTY_CLAIM : 27] ERRORc.u.c.i.w.PollerOptions-uncaught exception
java.lang.RuntimeException: Failure processing activity task. WorkflowID=6cce066f-e908-3491-85eb-b65b2543ea7c, RunID=28bac3e9-0b82-493c-9ace-41ecde666e62, ActivityType=CommonClaimActivities::processClaim, ActivityID=0
    at com.uber.cadence.internal.worker.ActivityWorker$TaskHandlerImpl.wrapFailure(ActivityWorker.java:218)
    at com.uber.cadence.internal.worker.ActivityWorker$TaskHandlerImpl.wrapFailure(ActivityWorker.java:158)
    at com.uber.cadence.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:76)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: com.uber.cadence.converter.DataConverterException: when parsing:"{"mismatches":[],"accessToken":null,"insurerId":"9999","requestDate":null,"requestId":"c29b3663-5573-41d3-a87d-38f0e6c96581","contractId":"124361","claimInfo":{"insuranceType":"PROPERTY","documentType":"PAYMENT","operationType":"CREATION","insurerClaimId"" into following types: [class com.example.api.common_claim.model.TransientCommonClaim]
    at com.uber.cadence.converter.JsonDataConverter.fromDataArray(JsonDataConverter.java:209)
    at com.uber.cadence.internal.sync.POJOActivityTaskHandler$POJOActivityImplementation.execute(POJOActivityTaskHandler.java:187)
    at com.uber.cadence.internal.sync.POJOActivityTaskHandler.handle(POJOActivityTaskHandler.java:169)
    at com.uber.cadence.internal.worker.ActivityWorker$TaskHandlerImpl.handle(ActivityWorker.java:183)
    at com.uber.cadence.internal.worker.ActivityWorker$TaskHandlerImpl.handle(ActivityWorker.java:158)
    at com.uber.cadence.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:71)
    ... 3 common frames omitted
Caused by: java.lang.UnsupportedOperationException: JsonObject
    at com.google.gson.JsonElement.getAsString(JsonElement.java:192)
    at com.example.util.gsonconverter.LocalDateConverter.deserialize(LocalDateConverter.java:61)
    at com.example
.util.gsonconverter.LocalDateConverter.deserialize(LocalDateConverter.java:18)
    at com.google.gson.internal.bind.TreeTypeAdapter.read(TreeTypeAdapter.java:69)
    at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$1.read(ReflectiveTypeAdapterFactory.java:131)
    at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.read(ReflectiveTypeAdapterFactory.java:222)
    at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$1.read(ReflectiveTypeAdapterFactory.java:131)
    at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.read(ReflectiveTypeAdapterFactory.java:222)
    at com.google.gson.Gson.fromJson(Gson.java:927)
    at com.google.gson.Gson.fromJson(Gson.java:994)
    at com.google.gson.Gson.fromJson(Gson.java:967)
    at com.example.cadence.adapter.ClaimInfoDeserializer.getClaimBody(ClaimInfoDeserializer.java:41)
    at com.example.cadence.adapter.ClaimInfoDeserializer.deserialize(ClaimInfoDeserializer.java:29)
    at com.example.cadence.adapter.ClaimInfoDeserializer.deserialize(ClaimInfoDeserializer.java:15)
    at com.google.gson.internal.bind.TreeTypeAdapter.read(TreeTypeAdapter.java:69)
    at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$1.read(ReflectiveTypeAdapterFactory.java:131)
    at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.read(ReflectiveTypeAdapterFactory.java:222)
    at com.google.gson.Gson.fromJson(Gson.java:927)
    at com.google.gson.Gson.fromJson(Gson.java:892)
    at com.google.gson.Gson.fromJson(Gson.java:841)
    at com.uber.cadence.converter.JsonDataConverter.fromDataArray(JsonDataConverter.java:178)
    ... 8 common frames omitted
mfateev commented 5 years ago

The first trace shows that the workflow is waiting on a child workflow. I'm not sure how it is related to an activity invocation. The failure trace shows that the activity execution is failing due to the exception thrown by the data converter. Do you have a retry policy specified for the activity? The this activity would be retried up to the retry policy expiration. Have you looked at the trace/history of the child workflow that supposedly invoked the failing activity?

kalibek commented 5 years ago

Hi @mfateev, we have retry policy and execution timeout set. But the thing is that workflow is set to 'blocked' state when it is clearly failed. So we are basically asking for fail fast option for exceptions on data converter. Here's the minimal reproducible project https://github.com/kalibek/cadence-deserialization-bug

mfateev commented 5 years ago

Specify DataConverterExceptions as non retryable in your activity retry options. Otherwise the activity is going to be retried up to the retry policy expiration interval.

DmitriyLamzin commented 5 years ago

I tried your suggestion but this is not the issue. The issue is that the child workflow doesn't respond in this case to the parent one. And while the child workflow is ready to accept new data the parent workflow hangs on waiting response.

The thrown Exception handled in the PollTaskExecutor

with the code snippet:

@Override
  public void process(T task) {
    taskExecutor.execute(
        () -> {
          MDC.put(LoggerTag.DOMAIN, domain);
          MDC.put(LoggerTag.TASK_LIST, taskList);
          try {
            handler.handle(task);
          } catch (Throwable ee) {
            options
                .getPollerOptions()
                .getUncaughtExceptionHandler()
                .uncaughtException(Thread.currentThread(), handler.wrapFailure(task, ee)); <----
          } finally {
            MDC.remove(LoggerTag.DOMAIN);
            MDC.remove(LoggerTag.TASK_LIST);
          }
        });
  }

And that's it. No answer goes to the parent flow. And it waits for timeout expiration.

Even if I set custom exception handler I can do little about it.

DmitriyLamzin commented 5 years ago

I think this is because in com.uber.cadence.internal.sync.POJOActivityTaskHandler.POJOActivityImplementation#execute

the dataConverter is out of the try-catch block:

public ActivityTaskHandler.Result execute(
        IWorkflowService service, String domain, ActivityTaskImpl task, Scope metricsScope) {
      ActivityExecutionContext context =
          new ActivityExecutionContextImpl(service, domain, task, dataConverter, heartbeatExecutor);
      byte[] input = task.getInput();
-->Object[] args = dataConverter.fromDataArray(input, method.getGenericParameterTypes()); 
      CurrentActivityExecutionContext.set(context);
-->try {
        Object result = method.invoke(activity, args);
        RespondActivityTaskCompletedRequest request = new RespondActivityTaskCompletedRequest();
        if (context.isDoNotCompleteOnReturn()) {
          return new ActivityTaskHandler.Result(null, null, null, null);
        }
        if (method.getReturnType() != Void.TYPE) {
          request.setResult(dataConverter.toData(result));
        }
        return new ActivityTaskHandler.Result(request, null, null, null);
      } catch (RuntimeException | IllegalAccessException e) {
        return mapToActivityFailure(task.getActivityType(), e, metricsScope);
      } catch (InvocationTargetException e) {
        return mapToActivityFailure(task.getActivityType(), e.getTargetException(), metricsScope);
      } finally {
        CurrentActivityExecutionContext.unset();
      }
    }
DmitriyLamzin commented 5 years ago

the issue is the same as #374

mfateev commented 5 years ago

I see. I've got confused about relationship of activity failure and the child workflow. #374 makes sense to me. Let me fix the issue.

mfateev commented 5 years ago

Fix: https://github.com/uber/cadence-java-client/pull/381