Open OndroMih opened 3 years ago
The reason for the exception is that the JobOperator.start()
method creates an instance of the job and persists it within the transaction that is created for running the timer method in the JobScheduler
EJB. Then the JobOperator
runs the first batch step asynchronously in a new thread where it tries to persist the checkpoint after that step. The new thread uses autocommit and thus SQL INSERT statement is applied immediately. The timer method sleeps for 1 second and the first transaction thus commits 1 second later after the new thread tries to persist the checkpoint. This assumes that the job data is already saved but the first transaction, which saves the job, isn't committed yet. If the sleep instruction is removed, the job runs OK most of the time but sometimes the background thread is faster and tries to autocommit before the first thread commits, which results in the same exception.
Everything works as expected if transaction management of the JobScheduler
EJB is set to BEAN: @TransactionManagement(TransactionManagementType.BEAN)
. This causes that the timer method runs without a transaction and the JobOperator
uses autocommit to commit immediately. This ensures that the job data is committed before the step is exxecuted in a new thread and the job data is already available when the step is persisted.
A possible fix is that the JobOperator.start()
runs all SQL statements in a nested transaction. Or that it persists job and the first step on a new thread instead of persisting the job in the current thread and the step in a new thread. It's important to ensure that the transaction that writes the Job data commits (autocommits) before the step data is written.
I think this should be raised at the Jakarta Batch spec project. Not sure this is a bug. If there's a global TX around the creation of a job what is the developer expecting. This is similar to say a global-tx around submitting an Jakarta Concurrency task or enqueuing a message onto a JMS queue.
I agree we could have some better error messages around this.
Hi @smillidge , I'm pretty sure it's a bug but I agree that it should be raised at the upstream JBatch RI project, and ideally at the Jakarta Batch project to standardize the behavior. The Jakarta Batch spec doesn't say anything about running a job in a global transaction.
The reason why I believe it's a bug is that I would expect that in a global transaction job would be scheduled according to one of the following scenarios:
Since the spec mandates that the job should reuse the existing global transaction (although this isn't clarified enough in the spec), I believe the best solution is to wait with the job execution until the global transaction is committed. This also makes the most logic to me, because:
JobOperator().start()
should execute asynchronously and it implies that it's not connected to the transaction. So it's OK if it's executed after the transaction commits. There's no reason why the job should be started within the transaction or even have some impact on whether the transaction commits or not. The argument 2. applies to all options but argument 1. applies only to the first option. That's why I believe the first option is semantically the most correct.
I'll raise this with the JBatch RI and also the Jakarta Batch project to get their opinion.
Reproduced and created FISH-1179
to keep track of this issue.
It looks like OpenLiberty provides its own JobOperator that suspends the current transaction and commits the changes directly: https://github.com/OpenLiberty/open-liberty/blob/integration/dev/com.ibm.jbatch.container/src/com/ibm/jbatch/container/api/impl/JobOperatorImplSuspendTran.java
The JBatch JobOperator will be nehanced in version 2.1.0 with this PR: https://github.com/WASdev/standards.jsr352.jbatch/pull/77. Although I haven't tested it yet (there's no way to test it with Payara 5 because it's not compatible), I'm pretty sure it fixes this issue because it fixes a very similar issue in GlassFish and Jakarta Batch 2.1 TCK. Therefore this issue will most probably be fixed in Payara 6 once it integrates JBatch 2.1.0.
Description
When a batch job is started from a managed transaction (e.g. from a stateless EJB), it sometimes (not always) fails to execute the job.
This is with the default configuration when the Batch engine uses the default H2 datasource. I haven't tested with other databases.
When the job fails, there's a NullPointerException in the server log. It turns out that it's not the original exception, it only covers the original PersistenceException exception, which is swallowed. The information from the debugger shows that the original PersistenceException says:
But when I check the data in the database, the table EXECUTIONINSTANCEDATA contains the JOBEXECID which is going to be inserted into STEPEXECUTIONINSTANCEDATA. This probably means that it wasn't in the table at the time when the INSERT into STEPEXECUTIONINSTANCEDATA was executed but it was added later.
Expected Outcome
A timer defined on a Stateless EJB starts a Batch job with the following code:
The job runs successfully as defined in the XML, without any exceptions.
Current Outcome
Most of the times, the batch job runs successfully as expected. Sometimes (roughly 1 in 10 times), it fails with the following stacktrace:
nullpointer_stacktrace.txt
According to the source code, the exception is raised from a catch block that is already handling an exception. The original exception is swallowed but the debugger shows that this is the original exception:
persistenceexc_stacktrace.txt
Steps to reproduce (Only for bug reports)
Just build the attached application batch-issue.zip with
mvn install
and deploy to Payara Server.Context (Optional)
This issue was originally reported in Payara Forum for the CargoTracker project.
Environment