imixs / imixs-workflow

The open source technology for business process management
http://www.imixs.org/
GNU General Public License v3.0
389 stars 71 forks source link

Concurrency Bug in Imixs-Workflow + Imixs-Microservices #922

Closed hectorfausm closed 6 days ago

hectorfausm commented 1 month ago

Describe the bug I have implemented a method to test how imixs-workflow + imixs-microservices handle concurrent load. When I execute a load with 5 simultaneous threads making requests, I encounter the following error:

Caused by: java.lang.NullPointerException: Cannot assign field "fChildIndex" because "this.fNodeListCache" is null
        at java.xml/com.sun.org.apache.xerces.internal.dom.ParentNode.nodeListItem(ParentNode.java:783)
        at java.xml/com.sun.org.apache.xerces.internal.dom.ParentNode.item(ParentNode.java:797)
        at deployment.imixs-microservice.war//org.openbpmn.bpmn.BPMNModel.findChildNodesByName(BPMNModel.java:1464)
        at deployment.imixs-microservice.war//org.openbpmn.bpmn.BPMNModel.findChildNodeByName(BPMNModel.java:1487)
        at deployment.imixs-microservice.war//org.imixs.workflow.bpmn.BPMNEntityBuilder.build(BPMNEntityBuilder.java:100)
        at deployment.imixs-microservice.war//org.imixs.workflow.WorkflowKernel.eval(WorkflowKernel.java:634)
        at deployment.imixs-microservice.war//org.imixs.workflow.engine.WorkflowService.evalNextTask(WorkflowService.java:1160)
        at deployment.imixs-microservice.war//org.imixs.workflow.engine.plugins.ApplicationPlugin.run(ApplicationPlugin.java:102)
        at deployment.imixs-microservice.war//org.imixs.workflow.WorkflowKernel.runPlugins(WorkflowKernel.java:1190)
        at deployment.imixs-microservice.war//org.imixs.workflow.WorkflowKernel.executeMicroKernels(WorkflowKernel.java:787)
        at deployment.imixs-microservice.war//org.imixs.workflow.WorkflowKernel.processEvent(WorkflowKernel.java:443)
        at deployment.imixs-microservice.war//org.imixs.workflow.WorkflowKernel.process(WorkflowKernel.java:410)
        at deployment.imixs-microservice.war//org.imixs.workflow.engine.WorkflowService.processWorkItem(WorkflowService.java:603)
        at jdk.internal.reflect.GeneratedMethodAccessor141.invoke(Unknown Source)

SEVERE [org.imixs.workflow.jaxrs.WorkflowRestService] (default task-5) INTERNAL ERROR: Cannot assign field "fChildIndex" because "this.fNodeListCache" is null

To Reproduce Steps to reproduce the behavior:

  1. Launch [imixs-microservices + imixs-workflow] + postgresql:13.11 (I also tested with 17.4) + imixs-admin using docker-compose.
  2. Execute more than 5 (the more threads, the higher the frequency) simultaneous threads using a small application that calls the Imixs REST services to start a process via Melman:
public String initProcess() {
    WorkflowClient workflowCLient = new WorkflowClient("http://xxxxxx:8080/api/");
    BasicAuthenticator basicAuth = new BasicAuthenticator("admin", "adminadmin");
    workflowCLient.registerClientRequestFilter(basicAuth);
    ItemCollection workitem=new ItemCollection();
    workitem.model("X.X.X").task(100).event(20);
    workitem.replaceItemValue("type", "workitem");
    Map<String, String> formData = new HashMap<>();
    formData.put("userId", "C0001");
    formData.put("dni", "123456789Z");
    workitem.appendItemValue("form_data", formData);
    try {
        workitem=workflowCLient.processWorkitem(workitem);
    } catch (RestAPIException e) {
        e.printStackTrace();
    }
    String inputId = workitem.getUniqueID();
    System.out.println("Init: "+inputId);
    return inputId;
}
  1. Check the logs for the error. Test client log:
org.imixs.melman.RestAPIException: 406: Not Acceptable
        at org.imixs.melman.DocumentClient.postXMLDocument(DocumentClient.java:464)
        at org.imixs.melman.WorkflowClient.processWorkitem(WorkflowClient.java:78)
        at es.home.imixs.docker.rest_test.model.OperationsRunner.initProcess(OperationsRunner.java:43)
        at es.home.imixs.docker.rest_test.model.OperationsRunner.lambda$1(OperationsRunner.java:67)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:840)
=== TEST REPORT ===
Total operations: 5
Successfully started: 4
Average init time (ms): 41
Total test time (ms): 57

Expected behavior Imixs should be able to synchronize concurrent events and avoid synchronization issues with Xerces objects.

Additional context I investigated a bit and it appears to be a synchronization issue. I attempted to wrap the method that retrieves the cached BPMNModel to return a new copy every time an object is requested from the cache, but I was not successful.

rsoika commented 1 month ago

Hi @hectorfausm , thanks for your finding! Yes I analyzed the code now also in deep and it sees to be all related to the new BPMN ModelManager used by the EJB ModelService.

The root problem is that the XML parsers (xerces in this case) are not thread save. So operations on the Dom tree will fail.

Do you have the chance to test a small change in the class org.imixs.workflow.engine.ModelService ?

Just change the class header:

@DeclareRoles({ "org.imixs.ACCESSLEVEL.NOACCESS", "org.imixs.ACCESSLEVEL.READERACCESS",
        "org.imixs.ACCESSLEVEL.AUTHORACCESS", "org.imixs.ACCESSLEVEL.EDITORACCESS",
        "org.imixs.ACCESSLEVEL.MANAGERACCESS" })
@RolesAllowed({ "org.imixs.ACCESSLEVEL.NOACCESS", "org.imixs.ACCESSLEVEL.READERACCESS",
        "org.imixs.ACCESSLEVEL.AUTHORACCESS", "org.imixs.ACCESSLEVEL.EDITORACCESS",
        "org.imixs.ACCESSLEVEL.MANAGERACCESS" })
@Singleton
@Lock(LockType.WRITE)  // This is the key change
public class ModelService {
    // ... existing implementation ...
}

This should have the following effect:

  1. Removes @ConcurrencyManagement (which was redundant since it's the default)
  2. Adds @Lock(LockType.WRITE) which means:
    • All methods will be serialized (one thread at a time)
    • No concurrent access to the ModelManager or its BPMNModels
    • Complete thread safety for your DOM operations

I hope this solves the issue.

rsoika commented 1 month ago

Ok, I tested it myself. This suggested change does not solve the issue. I need to investigate in more detail. Maybe we need a change in the Open-BPMN Metamodel Implementation.

The problem is the DOM tree hold by the org.openbpmn.bpmn.BPMNModel class in the instance variable doc which is not thread save. Let's stay connected....

hectorfausm commented 1 month ago

Indeed, I just tried the change you mentioned, and the error still persists. Thank you so much for the quick response and for the project. Just trying to help: maybe creating a deep copy of the object each time it is returned? I gave it a try, but the object is very complex and I haven’t quite managed to get it working.

Thanks again!

rsoika commented 1 month ago

Yes I try to avoid a deep copy. I think I can implement a pool mechanism. Currently I am doing some refactoring. Looks good so far...

hectorfausm commented 1 month ago

I tried to compile the branch and got some compilation errors. I assume it's still a work in progress. If I can help with anything, feel free to let me know.

Best regards!

rsoika commented 1 month ago

Yes the branch is under heavy development ;-) - lot of refactoring (mostly junit tests ;-) I am near to the end. Yes you can help me with the critical method to create the thread save version of the BPMNModel object later. But first I need to complete the refactoring part....

rsoika commented 1 month ago

Hi @hectorfausm - good news it looks like this issue is fixed with the latest snapshot version 6.2.2-SNAPSHOT (master branch)

The surprising thing is, that even my initially naive approach, of simply cloning the BPMNModel, shows hardly any noticeable effects on performance. Overall, however, there was a lot of refactoring in the WorkflowKernel, ModelManager and ModelService.

The ModelService now only manages the model metadata and returns a new exclusive BPMN model on each request - and no longer a shared version of the BPMNModel (with its embedded DOM tree).

The relevant method for this part you can see in the method getBPMModel

I expected that constructing a complete new BPMNModel from the BPMN raw data has more impact on performance. However I also implemented a lot of new local caching in the WorkflowKernel. So even in very complex processing cycles the performance should be still high.

In the Imixs-Microservice project I also added a new junit test to run a load test (Integration Test) when the service is running: https://github.com/imixs/imixs-microservice/blob/master/src/test/java/org/imixs/workflow/rest/LoadTest.java

I would greatly appreciate your feedback on this topic!

hectorfausm commented 1 month ago

Hello, I’ve been testing batch executions following the same procedure I used when I encountered the error, and so far I have the following results:

=== TEST REPORT === Total operations: 30 Successfully started: 30 Average init time (ms): 133 Total test time (ms): 339


=== TEST REPORT === Total operations: 50 Successfully started: 50 Average init time (ms): 174 Total test time (ms): 329


=== TEST REPORT === Total operations: 100 Successfully started: 100 Average init time (ms): 319 Total test time (ms): 552

I stopped testing here because running more than 100 threads at once is excessive.

rsoika commented 1 month ago

Thanks for your feedback. How is your overall impression now?

hectorfausm commented 1 month ago

I said "excessive" in the sense that it's "more than enough." The error has disappeared, and as you rightly said, it's a bit slower, but I still think the response times are acceptable—especially considering that, even with 100 threads launched all at once, they're still responding in under half a second.