quarkusio / quarkus

Quarkus: Supersonic Subatomic Java.
https://quarkus.io
Apache License 2.0
13.35k stars 2.55k forks source link

Unexpected commit of transaction using Oracle #36265

Open segersb opened 9 months ago

segersb commented 9 months ago

Describe the bug

The Oracle JDBC driver performs a commit when a connection is closed without an explicit commit or rollback. When Quarkus is shut down gracefully, connections are closed properly, and this causes unfinished transactions to be committed.

Consider the following simple rest endpoint. We insert a record in the database, wait 2 minutes and then throw an exception.

@PermitAll
@Path("/test")
public class TestResource {
    @Inject
    EntityManager entityManager;

    @GET
    @Transactional
    @SneakyThrows
    public Response test() {
        entityManager.createNativeQuery("insert into TEST_TABLE (TEST_COLUMN) values (1)").executeUpdate();

        System.out.println("shut down app now");
        Thread.sleep(120000);

        if (true) {
            throw new RuntimeException("app was not shut down, throwing exception to avoid transaction commit");
        }

        return Response.ok().build();
    }
}

Expected behavior

I would expect that nothing is inserted in the database.

Actual behavior

When doing a graceful shutdown during the wait period, the insert is committed.

How to Reproduce?

See code above for the simple rest endpoint.

To perform a graceful shutdown locally we need to start quarkus-dev via a java command, and not mvn quarkus:dev. Fortunately when I execute quarkus-dev via IntelliJ, the java command is the first line of the output. This allows us to use CTRL-C, which will do a graceful shutdown.

  1. run the java command to start quarkus:dev
  2. call the rest endpoint
  3. wait for the quarkus console to output "shut down app now"
  4. press CTRL-C and wait for the Quarkus to shut down
  5. check the database

Output of uname -a or ver

No response

Output of java -version

No response

GraalVM version (if different from Java)

No response

Quarkus version or git rev

No response

Build tool (ie. output of mvnw --version or gradlew --version)

No response

Additional information

No response

quarkus-bot[bot] commented 9 months ago

/cc @DavideD (hibernate-reactive), @Sanne (hibernate-orm,hibernate-reactive), @gavinking (hibernate-reactive), @gsmet (hibernate-orm), @mswatosh (db2), @yrodiere (hibernate-orm)

yrodiere commented 8 months ago

I didn't try this locally yet (have to download an Oracle container on a low-bandwidth connection...), but IIUC the main problem is that the interceptor doesn't trigger a rollback before Agroal gets stopped (and connections get closed).

I wonder if the root cause isn't simply that your thread gets interrupted, leading to Thread.sleep throwing an InterruptedException, which, crucially, isn't a RuntimeException and thus doesn't trigger a rollback...

See https://jakarta.ee/specifications/transactions/2.0/jakarta-transactions-spec-2.0#transactional-annotation:

By default checked exceptions do not result in the transactional interceptor marking the transaction for rollback and instances of RuntimeException and its subclasses do.

(And yes I agree it's a crazily dangerous default, but that's the spec.)

Could you try annotating your method with this instead, and tell me if it solves your problem?

@Transactional(rollbackOn={Exception.class})
segersb commented 8 months ago

Thanks for your reply!

The result is still the same. I used Thread.sleep in a rest endpoint here to make a simple example. We noticed the issue with longer running processes combined with redeployment on Openshift.

I also verified this "commit on close" behavior with basic JDBC. If I don't explicitly roll back, then the insert is committed when closing the connection. My guess is that Oracle is the only database provider that behaves like this.

        try {
            Class.forName("oracle.jdbc.driver.OracleDriver");
            conn = DriverManager.getConnection(jdbcUrl, username, password);
            conn.setAutoCommit(false);

            String insertSQL = "INSERT INTO D00SREFTPB01.TEST_TABLE (TEST_COLUMN) VALUES (?)";
            try (PreparedStatement pstmt = conn.prepareStatement(insertSQL)) {
                pstmt.setLong(1, System.currentTimeMillis());
                pstmt.executeUpdate();
                System.out.println("Dummy data inserted into TEST_TABLE");
            }

//            conn.rollback();
        } catch (SQLException | ClassNotFoundException e) {
            e.printStackTrace();
        } finally {
            if (conn != null) {
                try {
                    conn.close();
                } catch (SQLException e) {
                    e.printStackTrace();
                }
            }
        }
yrodiere commented 8 months ago

My guess is that Oracle is the only database provider that behaves like this.

That's for sure, and from what I can see it's well-known to Oracle users; found this in particular.

I'm not too worried about Oracle's behavior though, more about the fact that the transaction wasn't rolled back by the interceptor (so, before the connection was closed), especially with @Transactional(rollbackOn={Exception.class}).

There must be something wrong on shutdown, maybe Agroal stopping before application threads, I don't know. Needs further investigation.

segersb commented 8 months ago

Note I did fix the issue with a workaround by using a AgroalDataSourceListener.

The code is dirty as there is no out-of-the-box solution to set a custom AgroalDataSourceListener in Quarkus. I also created a feature request to allow for that (https://github.com/quarkusio/quarkus/issues/36263)

But in essence I just do a rollback in beforeConnectionDestroy Normally if the transaction was successful, Quarkus should have performed the commit and there would be nothing to roll back.

@ApplicationScoped
public class CustomOracleRollback implements StartupInitializer, AgroalDataSourceListener {
    @Inject
    DataSources dataSources;

    @Override
    @SneakyThrows
    @SuppressWarnings("all")
    public void initialize() {
        DataSource dataSource = (DataSource) dataSources.getDataSource("<default>");

        Field connectionPoolField = DataSource.class.getDeclaredField("connectionPool");
        connectionPoolField.setAccessible(true);
        ConnectionPool connectionPool = (ConnectionPool) connectionPoolField.get(dataSource);

        Field listenersField = ConnectionPool.class.getDeclaredField("listeners");
        listenersField.setAccessible(true);
        AgroalDataSourceListener[] listeners = (AgroalDataSourceListener[]) listenersField.get(connectionPool);

        List<AgroalDataSourceListener> newListeners = new ArrayList<>(Arrays.asList(listeners));
        newListeners.add(this);

        Field theUnsafeField = Unsafe.class.getDeclaredField("theUnsafe");
        theUnsafeField.setAccessible(true);
        Unsafe unsafe = (Unsafe) theUnsafeField.get(null);

        long offset = unsafe.objectFieldOffset(listenersField);
        AgroalDataSourceListener[] newListenersArray = newListeners.toArray(new AgroalDataSourceListener[0]);
        unsafe.putObject(connectionPool, offset, newListenersArray);
    }

    @Override
    public void beforeConnectionDestroy(Connection connection) {
        try {
            connection.rollback();
        } catch (SQLException e) {
            // ignore
        }
    }
}
segersb commented 8 months ago

Note the issue is only happening on graceful shutdowns. I'm not sure how this is handled by Quarkus, but my guess is that the interceptor you are referring to never gets executed. Maybe the executor thread pool is just shut down with a timeout and threads are killed if they take too long to finish.

yrodiere commented 8 months ago

and threads are killed if they take too long to finish

That's the thing: if threads are killed, the call to Thread.sleep should throw an InterruptedException, and if you put @Transactional(rollbackOn={Exception.class}) on your method, that should trigger a rollback.

And all that should happen before the DB connection gets closed, because why on earth would Quarkus stop DB connection pools before killing application threads...

If it doesn't happen that way, there's something wrong...

Another theory would be that @SneakyThrows somehow confuses Quarkus and leads to the interceptor misbehaving. While unlikely, I suppose that's possible.

yrodiere commented 6 months ago

Unfortunately I didn't have time to make progress on this.

And all that should happen before the DB connection gets closed, because why on earth would Quarkus stop DB connection pools before killing application threads...

If it doesn't happen that way, there's something wrong...

I had to work on the Agroal code recently, and noticed this:

https://github.com/quarkusio/quarkus/blob/5a878c607699ddbd0660bc962b859a8e44672db2/extensions/agroal/runtime/src/main/java/io/quarkus/agroal/runtime/DataSources.java#L407-L415

If Vert.x threads are killed in the same way, there's indeed a chance the order is undetermined (vertx threads may be killed before or after agroal connection pools, depending on who knows what)

segersb commented 6 months ago

Maybe killing the applications threads don't cause an explicit rollback of the transaction?

I can't really comment on Quarkus internals, but the root cause of my issue is that Oracle commits on close. The dirty workaround I have now simply makes sure there is always an explicit rollback before the close. I believe the custom listener I made is triggered by the dataSource.close() in the code you mentioned.

I'm not sure what the best solution is for Quarkus. The "explicit rollback before close" solution seems pretty robust to me. But you could argue it's redundant and should be handled by the code behind @Transactional and the shutdown order.

yrodiere commented 6 months ago

Maybe killing the applications threads don't cause an explicit rollback of the transaction?

If you used @Transactional(rollbackOn={Exception.class}) (and you said earlier you did try that), then InterruptedException definitely should cause a rollback, at least that's how I understand the spec. But yes, maybe the bug is there, we'll need to check.

I can't really comment on Quarkus internals, but the root cause of my issue is that Oracle commits on close.

Yeah that... is really a dubious design choice. I guess we could override this behavior by default if there's a setting in the JDBC driver, but then if you had to resort to complicated listeners, I guess there isn't such a setting.

But you could argue it's redundant and should be handled by the code behind @transactional and the shutdown order.

Exactly. Because there might be other, similar problems caused by this (still hypothetical) wrong shutdown order.

czenczl commented 6 months ago

HI!

Probably, we also encountered the mentioned unexpected commit.

This commit occurred for us when we tried to shut down Wildfly(27) gracefully. At that time, we noticed that in some cases, only a part of the tables was committed to the Oracle database (we experienced the same on PostgreSQL).

see: https://github.com/ironjacamar/ironjacamar/blob/main/adapters/src/main/java/org/jboss/jca/adapters/jdbc/BaseWrapperManagedConnection.java#L174

There exists a property in version 1.5.9.Final ironjacamar jdbc where this can be set, and after that, the cleanup function handles the commit as expected for us (checked the main branch, its also looks good to me):

https://github.com/ironjacamar/ironjacamar/blob/1.5.9.Final/adapters/src/main/java/org/jboss/jca/adapters/jdbc/BaseWrapperManagedConnection.java#L362C10-L362C10 (JAVA_OPTS: "-Dironjacamar.jdbc.setautocommitoncleanup=false")

Without this, we also observed that records appeared in the database that shouldn't have. I think Quarkus also uses similar libraries; perhaps it will help.

Some tickets we started from: https://issues.redhat.com/browse/JBEAP-23764 https://issues.redhat.com/browse/JBJCA-1431

segersb commented 6 months ago

Thanks for the input!

I tried looking for some property to set before, but didn't find any. Maybe the Quarkus devs can work with this.

In sqlplus it's the same behavior and you can turn it off with set exitcommit off (it's on by default) However I was unable to find the same setting on the OJBDC driver.

yrodiere commented 5 months ago

Hey @segersb , while working on something else I stumbled upon this: https://quarkus.io/guides/lifecycle#graceful-shutdown

Could you confirm that when you talk about graceful shutdowns, you mean it in this specific sense (you set quarkus.shutdown.timeout in your app), not in a more general, technology-agnostic sense? Because apparently Quarkus shutdowns are not, by default, graceful.

This probably doesn't change the problem, but would be interesting to know.

segersb commented 4 months ago

Hello @yrodiere

We haven't set anything specific for that parameter, so we just use the default.

We detected the issue when running our Quarkus app on OpenShift. We have a batch with a long transaction that inserts records in a database. Of course we expect that either all the records or none are inserted.

When a redeploy is done in OpenShift while this batch is running, we end up having partial data.

When running the standard mvn quarkus-dev + CTRL-C locally, we could not replicate this issue. When pressing CTRL-C the process is terminated and the prompt is shown instantly.

When running the quarkus:dev via java.exe + CTRL-C locally, we were able to replicate the issue. When pressing CTRL-C it takes a while before the process is terminated and the prompt is shown.

This led us to the conclusion that the shutdown method is somehow involved. When the issue occurs it seems that the connection is explicitly closed, causing Oracle to commit on close. When the application is instantly terminated there is no issue since Oracle does no commit when the connection is lost.

ivu-mawi commented 4 months ago

Hi, we also have this exact issue. First observed in our Kubernetes-based test environment, where Kubernetes sends a SIGTERM to the Quarkus app, after which a partial transaction got committed to the Oracle DB.

I can reliably reproduce it using Quarkus dev mode and pressing CTRL+C or q during a large transaction with 100k+ inserts to multiple tables. Then, some thousands of table entries are visible, but not all.

Using the workaround from @segersb above (https://github.com/quarkusio/quarkus/issues/36265#issuecomment-1750652882) almost, but not quite, fixes the issue. The call to rollback happens without throwing an exception, but a SINGLE table entry is still inserted anyway.

One thing I've noted is that "Arjuna" also tries to rollback the transaction, but fails because the connection was already closed:

[co.ar.at.jta] (database-worker-saving-entities-1) ARJUNA016045: attempted rollback of < formatId=131077, gtrid_length=35, bqual_length=36, tx_uid=0:ffff7f000101:ac03:65c63ff6:2a, node_name=quarkus, branch_uid=0:ffff7f000101:ac03:65c63ff6:2d, subordinatenodename=null, eis_name=0 > (io.agroal.narayana.LocalXAResource@5158e0b4) failed with exception code XAException.XAER_RMERR: javax.transaction.xa.XAException: Error trying to transactionRollback local transaction: ORA-17008: Closed connection

I'd be very happy if Quarkus could reliably rollback all open transactions BEFORE closing the connections. We're currently using Quarkus 3.6.8, graceful shutdown timeout not enabled.

segersb commented 4 months ago

Hi @ivu-mawi,

Thanks for your input!

Does your project maybe have multiple datasources? I'm hoping that might explain why the workaround is not fully working on your end. It was specifically made for our use case with only the default datasource: DataSource dataSource = (DataSource) dataSources.getDataSource("<default>");

If not I'm a bit worried we'll also encounter issues in our projects.

ivu-mawi commented 4 months ago

Hi, also thanks for reporting this and the workaround.

I'm afraid we also use a single, default datasource.

The single inserted entry doesn't always occur. But I'm guessing it only doesn't occur when I press CTRL+C too quickly before hibernate has decided to actually send statements to the driver or something.

ivu-mawi commented 4 months ago

I think my workaround now will be to simply delay the shutdown until all active transactions are finished:

private final AtomicInteger activeTransactionCount = new AtomicInteger(0);

void delayShutdownUntilTransactionsFinished(@Observes ShutdownEvent event) {
    while (activeTransactionCount.get() > 0) {
        LOG.info("Waiting for {} active transaction(s) to finish before shutdown", activeTransactionCount.get());
        try {
            Thread.sleep(5000);
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
        }
    }
}

// ... call increment and decrement in code that wraps each transaction ...

This might mean waiting too long, so Kubernetes will SIGKILL our app. But that means the the oracle connection will not be closed properly as well, and even Oracle does a rollback then in my tests, same as CTRL+C on Windows as you mentioned.

But I don't know how this would interact with any other ShutdownEvent Observers, if we had any. Would this delay the other ones, preventing their graceful shutdown until SIGKILL happens? Not a good solution in that case.

segersb commented 4 months ago

That's a creative workaround!

Normally you should be able to use @Priority annotation to make sure other shutdown listeners go first. We used that in the past for startup listeners, so I guess it will also work for shutdown.

SIGKILL always worked for me as well because Oracle only commits when connections are closed properly. I guess a shutdown listener that sleeps forever could also work, effectively forcing kube/oc to kill the pods.

Another approach might be to combine the Quarkus property quarkus.shutdown.timeout and the kube property terminationGracePeriodSeconds quarkus.shutdown.timeout=30S would make sure Quarkus does not start closing connections after 30S to be sure terminationGracePeriodSeconds: 0 would cause kube to kill the pods

ivu-mawi commented 4 months ago

Thanks, @Priority works! Couldn't find any docs, but seems like the default priority is < 10000, so I'm using 10000.

One thing I've noted is that any @Disposes handlers are called after all @Observes ShutdownEvent handlers, so depending on what handlers an app has, that might need to be considered.

I agree with the rest of what you said.

I'll stick with my "dynamic" delay, because it allows a normal graceful shutdown if no transactions are running, or very short transactions, which I expect is the much more likely scenario for our app. So the SIGKILL from Kubernetes would only be a rare last resort kind of thing, if someone happens to trigger a restart during one of those rare long transactions.

Still, would be much nicer if Quarkus rolls back before close of course :)

segersb commented 4 months ago

Great it helped!

Interestingly the doc for using quarkus.shutdown.timeout says: "Setting this timeout will incur a small performance penalty, as it requires active requests to be tracked." I guess they have a similar code to yours to track the open transactions.

If you still have issues, you could also play with the values of those properties. terminationGracePeriodSeconds can be set to maybe 30 second or so while quarkus.shutdown.timeout could be set to a minute. In a "normal" redeploy Quarkus should be able to shut down within the grace period op kube.

segersb commented 4 months ago

@yrodiere @gsmet Since I'm not alone having this issue, would it be possible to have a solution in Quarkus? The fact that we risk having partial transactions committed when using Oracle is not reassuring. When updating Quarkus we always have doubts that out dirty workaround keeps working properly.

yrodiere commented 4 months ago

I agree it's a problem, and I agree it should be solved. It's on my todo-list, but that todo-list is long and does not include just Quarkus.

If you want this to be solved quicker, feel free to send a PR or report this through paid support (on RHBQ, ...).

Failing that, you can rely on workarounds (I think a few were mentioned?) and I guess you can hope that the upgrade of Agroal #39072 will change something.

Felk commented 1 month ago

Stumbling across this after days of investigating a swath of lost data that was a consequence of some intermittent state being erroneously committed, I am thankful to have found this discussion and being able to present the workaround originally presented by @segersb in https://github.com/quarkusio/quarkus/issues/36265#issuecomment-1750652882 with the new capabilities added by https://github.com/quarkusio/quarkus/issues/36263 incorporated to simplify it:

/**
 * Oracle has the weird behavior that it performs an implicit commit on normal connection closure (even with auto-commit disabled),
 * which happens on application shutdown. To prevent whatever intermittent state we have during shutdown from being committed,
 * we add an explicit rollback to each connection closure. If the connection has already received a COMMIT, the rollback will not work, which is fine.
 * <p>
 * The code unwraps the {@link Connection} so that we perform the rollback directly on the underlying database connection,
 * and not on e.g. Agroal's {@link ConnectionWrapper} which can prevent the rollback from actual being executed due to some safeguards.
 */
@ApplicationScoped
public class CustomOracleRollback implements AgroalPoolInterceptor {
    @Override
    public void onConnectionDestroy(Connection connection) {
        try {
            if (connection.isClosed() || connection.getAutoCommit()) {
                return;
            }
            connection.unwrap(Connection.class).rollback();
        } catch (SQLException e) {
            Log.tracef(e, "Ignoring exception during rollback on connection close");
        }
    }
}
segersb commented 1 month ago

Hello @Felk

I also did the same on our side, though I did have to add an extra check before doing the rollback. Not sure you need this, but mentioning it just in case. We also use flyway on our side, and if I always did the rollback, the migration failed.

The connection.isClosed() check was just to avoid exception logging. The connection.getAutoCommit() check was the important one that fixed the flyway migrations.

    @SneakyThrows
    @Override
    public void onConnectionDestroy(Connection connection) {
        if (connection.isClosed() || connection.getAutoCommit()) {
            return;
        }

        OracleConnection oracleConnection = connection.unwrap(OracleConnection.class);
        oracleConnection.rollback();
    }
Felk commented 1 month ago

Good point, thanks! I edited my comment to include those two conditions