OndroMih commented 3 years ago

Description

When a batch job is started from a managed transaction (e.g. from a stateless EJB), it sometimes (not always) fails to execute the job.

This is with the default configuration when the Batch engine uses the default H2 datasource. I haven't tested with other databases.

When the job fails, there's a NullPointerException in the server log. It turns out that it's not the original exception, it only covers the original PersistenceException exception, which is swallowed. The information from the debugger shows that the original PersistenceException says:

Referential integrity constraint violation: "JOBEXEC_STEPEXEC_FK: PUBLIC.STEPEXECUTIONINSTANCEDATA FOREIGN KEY(JOBEXECID) REFERENCES PUBLIC.EXECUTIONINSTANCEDATA(JOBEXECID) (2)"; SQL statement: INSERT INTO STEPEXECUTIONINSTANCEDATA (jobexecid, batchstatus, exitstatus, stepname, readcount,writecount, commitcount, rollbackcount, readskipcount, processskipcount, filtercount, writeskipcount, starttime,endtime, persistentdata) VALUES(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) [23506-196]

But when I check the data in the database, the table EXECUTIONINSTANCEDATA contains the JOBEXECID which is going to be inserted into STEPEXECUTIONINSTANCEDATA. This probably means that it wasn't in the table at the time when the INSERT into STEPEXECUTIONINSTANCEDATA was executed but it was added later.

Expected Outcome

A timer defined on a Stateless EJB starts a Batch job with the following code:

BatchRuntime.getJobOperator().start("EventFilesProcessorJob", null);  // the EventFilesProcessorJob job is defined as a Batch job in a XML file at the expected location

The job runs successfully as defined in the XML, without any exceptions.

Current Outcome

Most of the times, the batch job runs successfully as expected. Sometimes (roughly 1 in 10 times), it fails with the following stacktrace:

nullpointer_stacktrace.txt

According to the source code, the exception is raised from a catch block that is already handling an exception. The original exception is swallowed but the debugger shows that this is the original exception:

persistenceexc_stacktrace.txt

Steps to reproduce (Only for bug reports)

Just build the attached application batch-issue.zip with mvn install and deploy to Payara Server.

Context (Optional)

This issue was originally reported in Payara Forum for the CargoTracker project.

Environment

Payara Version: 5.2020.7 (but the same applies also to 5.2021.1)
Edition: Full
JDK Version: Zulu 11
Operating System: Linux

OndroMih commented 3 years ago

The reason for the exception is that the JobOperator.start() method creates an instance of the job and persists it within the transaction that is created for running the timer method in the JobScheduler EJB. Then the JobOperator runs the first batch step asynchronously in a new thread where it tries to persist the checkpoint after that step. The new thread uses autocommit and thus SQL INSERT statement is applied immediately. The timer method sleeps for 1 second and the first transaction thus commits 1 second later after the new thread tries to persist the checkpoint. This assumes that the job data is already saved but the first transaction, which saves the job, isn't committed yet. If the sleep instruction is removed, the job runs OK most of the time but sometimes the background thread is faster and tries to autocommit before the first thread commits, which results in the same exception.

Everything works as expected if transaction management of the JobScheduler EJB is set to BEAN: @TransactionManagement(TransactionManagementType.BEAN). This causes that the timer method runs without a transaction and the JobOperator uses autocommit to commit immediately. This ensures that the job data is committed before the step is exxecuted in a new thread and the job data is already available when the step is persisted.

A possible fix is that the JobOperator.start() runs all SQL statements in a nested transaction. Or that it persists job and the first step on a new thread instead of persisting the job in the current thread and the step in a new thread. It's important to ensure that the transaction that writes the Job data commits (autocommits) before the step data is written.

smillidge commented 3 years ago

I think this should be raised at the Jakarta Batch spec project. Not sure this is a bug. If there's a global TX around the creation of a job what is the developer expecting. This is similar to say a global-tx around submitting an Jakarta Concurrency task or enqueuing a message onto a JMS queue.

I agree we could have some better error messages around this.

OndroMih commented 3 years ago

Hi @smillidge , I'm pretty sure it's a bug but I agree that it should be raised at the upstream JBatch RI project, and ideally at the Jakarta Batch project to standardize the behavior. The Jakarta Batch spec doesn't say anything about running a job in a global transaction.

The reason why I believe it's a bug is that I would expect that in a global transaction job would be scheduled according to one of the following scenarios:

The step execution would wait until the global transaction is committed. This is because it depends on that transaction being committed. If the global transaction rolls back, there's no reason to run the job.
Both job creation and step execution would be done asynchronously, disconnected from the global transaction. This would commit job creation before the step is started.
Job creation would be done in a nested transaction which would commit before the step is started and wouldn't affect the global transaction. This is very similar to the previous option but might go against the spec

Since the spec mandates that the job should reuse the existing global transaction (although this isn't clarified enough in the spec), I believe the best solution is to wait with the job execution until the global transaction is committed. This also makes the most logic to me, because:

if Job is created in a transaction and the transaction rolls back, the job shouldn't be executed or should be rolled back. This solution simply wouldn't execute the job. Reverting a custom job is complicated if not impossible in a compliant way.
JobOperator().start() should execute asynchronously and it implies that it's not connected to the transaction. So it's OK if it's executed after the transaction commits. There's no reason why the job should be started within the transaction or even have some impact on whether the transaction commits or not.

The argument 2. applies to all options but argument 1. applies only to the first option. That's why I believe the first option is semantically the most correct.

I'll raise this with the JBatch RI and also the Jakarta Batch project to get their opinion.

AlanRoth commented 3 years ago

Reproduced and created FISH-1179 to keep track of this issue.

OndroMih commented 2 years ago

It looks like OpenLiberty provides its own JobOperator that suspends the current transaction and commits the changes directly: https://github.com/OpenLiberty/open-liberty/blob/integration/dev/com.ibm.jbatch.container/src/com/ibm/jbatch/container/api/impl/JobOperatorImplSuspendTran.java

OndroMih commented 2 years ago

The JBatch JobOperator will be nehanced in version 2.1.0 with this PR: https://github.com/WASdev/standards.jsr352.jbatch/pull/77. Although I haven't tested it yet (there's no way to test it with Payara 5 because it's not compatible), I'm pretty sure it fixes this issue because it fixes a very similar issue in GlassFish and Jakarta Batch 2.1 TCK. Therefore this issue will most probably be fixed in Payara 6 once it integrates JBatch 2.1.0.

payara / Payara

A batch Job started from a managed transaction sometimes leads to a failed execution and NullPointerException / FISH-1179 #5133

Description

Expected Outcome

Current Outcome

Steps to reproduce (Only for bug reports)

Context (Optional)

Environment