next.jdbc + c3p0 deadlocks if you run a query on a different thread

camsaul commented 1 year ago

Root problem is next.jdbc is doing

(locking <Connection>
  ...)

here

https://github.com/seancorfield/next-jdbc/blob/dc240652fc5fda30fe6e289cc2a351ebfc008fd7/src/next/jdbc/transaction.clj#L119

c3p0 com.mchange.v2.c3p0.impl.NewProxyConnection (the underlying proxy class returned by c3p0) basically tries to lock on every single method

e.g.

https://github.com/sluk3r/c3p0/blob/df4ab37ee2fc2894e5d2bf44b3973ab7b87508bf/src/main/java/com/mchange/v2/c3p0/impl/NewProxyConnection.java#L42

and if this results in instant deadlocks if you run something from a different thread.

This definitely happens if you use with-transaction in combination with a Connection, because that's what gets locked; I haven't personally ran into issues using with-transaction with a DataSource, but I think you would run into the same problem with a DataSource implementation with a synchronized getConnection method.

I ran into this today when trying to figure out why our application started suddenly taking 6 minutes to launch, after digging most of the day I figured out that we had some code somewhere that was basically doing

(deref
 (future ...)
 5000
 ::timed-out)

and it was getting triggered inside of a transaction which lead to it hanging for 5 seconds, due to some other unrelated bugs it kept trying this over and over and disaster ensued.

To repro

(defn- user-counts [^java.sql.Connection conn]
  (let [counter (atom 0)
        user-count (fn []
                     (println "Try number" (swap! counter inc))
                     (with-open [stmt (.createStatement conn)]
                       (when (.execute stmt "SELECT count(*) FROM core_user;")
                         (with-open [rset (.getResultSet stmt)]
                           (when (.next rset)
                             (.getLong rset 1))))))]
    (println (user-count))
    (println (user-count))
    (println (deref
              (future
                (println "<on another thread>")
                (user-count))
              5000
              ::timed-out))))

(with-open [conn (.getConnection (com.mchange.v2.c3p0.DataSources/pooledDataSource
                                  (reify javax.sql.DataSource
                                    (getConnection [_this]
                                      (java.sql.DriverManager/getConnection "jdbc:h2:file:/home/cam/metabase/metabase.db")))))]
  (next.jdbc/with-transaction [conn conn]
    (user-counts conn)))

Try number 1
0
Try number 2
0
<on another thread>
Try number 3
::timed-out

I don't think c3p0 is doing anything that I would consider wrong here. It seems like any JDBC implementation could have synchronized methods in their concrete java.sql.Connection classes, which would trigger the same error, even with no connection pooling involved. I think it's risky to lock on any java.sql.Connection. Not sure locking on a DataSource is much better, I wouldn't be surprised if every single method in c3p0's DataSource implementations are synchronized as well, but that wasn't causing us any problems since we have our own DataSource impl that is less fussy.

I'm not really 100% sure I understand the purpose of the lock, otherwise I'd offer to submit a PR.

Also I would like to mention that we are considering switching from c3p0 to HikariCP at some point 😄

seancorfield commented 1 year ago

The assumption is that you should never reuse a Connection across multiple threads -- it simply isn't safe because Connection objects are mutable, and any code that changes, e.g., auto-commit, is going to affect (i.e., interfere with) any other threads using the same Connection.

camsaul commented 1 year ago

Right, we weren't trying to do anything kooky like pass that Connection around and run 10 different queries on 10 different threads with it or anything like that, we just had some code that ran something on a different thread only to enforce a timeout for something. If we could do that on the same thread we would, but I don't know of any way to do that. But the Connection is still being used the same way as if it was all on one Thread

seancorfield commented 1 year ago

The repro code passes the Connection to another thread and uses it there -- which is unsafe.

Can you provide repro code that exhibits the deadlock that does not involve accessing the (locked) Connection on another thread?

camsaul commented 1 year ago

No, as far as I know this is only an issue when you use the Connection in another thread.

I agree that sharing Connections across threads generally is generally not a good idea, but that's not exactly the same as an absolute rule. Example from the Oracle that says not to do it but then tells you what to do if you really must do it:

However, Oracle strongly discourages sharing a database connection among multiple threads. Avoid allowing multiple threads to access a connection simultaneously. If multiple threads must share a connection, use a disciplined begin-using/end-using protocol.

I don't think our use case is itself inheritably unsafe. These queries are happening serially, at no point are we ever running multiple queries at the same time. The fact that a separate thread is involved is really just an implementation detail, and the code is still executed in exactly the same order as if it was all done on a single thread. So it doesn't matter if different parts are running around changing .readOnly or the transaction isolation level, or anything like that. We're comfortable with Connection being mutable and are not doing anything where that's an issue.

Either way, you're the library author, so if you don't want to support people doing things they generally shouldn't be doing, that's understandable. I just want to point out that there is no way for someone to work around this issue other than very icky hacks, and the error situation here is not obvious at all and extremely hard to work out.

seancorfield commented 1 year ago

If you're just running queries why use with-transaction? If you don't use a transaction here, there won't be a lock on the Connection and then it's "caveat programmer". I'd have to dig quite a long way back in next.jdbc's history to provide the backstory on why it locks -- but it was doing it back before the nested TX stuff was added...

seancorfield commented 1 year ago

That was easier than I expected: https://github.com/seancorfield/next-jdbc/issues/106 -- April, 2020.

camsaul commented 1 year ago

Well, the query is getting triggered after we insert something something, but the way we have things set up it's all done inside of a transaction. Basically, after inserting Object X, then do Y. It probably doesn't have to be done this way in this case honestly.

Either way, I still think there are some real cases where you'd want to do something like this in real life.

Example: running tests inside of a transaction: open a transaction, insert a lot of rows, do something that fetches those rows and check some condition, then roll back the transaction.

seancorfield commented 1 year ago

Example: running tests inside of a transaction: open a transaction, insert a lot of rows, do something that fetches those rows and check some condition, then roll back the transaction.

Not on multiple threads. Not going to allow that. Going to close this now.

camsaul commented 1 year ago

Ehh, I think there are legitimate reasons another thread might have to be involved in the tests.

If you do something to enforce a timeout on something, like what I mentioned in the initial issue description
Testing REST API endpoints (depending on how you mock things out, you might have this pick a Thread from your Jetty thread pool or something like that)

Now that I think about it, I do have a real use case outside of tests:

Insert something
Fetch some info
Insert something else related to the thing(s) you inserted in step 1

Now it would make sense to do that all inside of a transaction I think. Maybe you switch steps 1 and 2 around, but that wouldn't be as natural as fetching the data when you actually need it. Doesn't seem like it would lead to particularly elegant code if you have to go out of your way to avoid doing any SELECTs inside of a transaction.

But I don't think the real question is whether mixing DML and DQL inside a single transaction is ever warranted. The question is whether you might ever legitimately need to run something on a separate thread, and I think the answer is a yes.

But I accept your decision, we'll figure out some way to work around it.

Would you be open to a PR to add this to the differences from clojure.java.jdbc documentation? I'm not sure we would have made the switch before having a workaround in mind if we had known about the implications

seancorfield commented 1 year ago

I might be open to a PR to not do the locking in some scenarios based on the binding of *nested-tx* so there was a way of opting out based on, essentially, opting in to known, unsafe nested TX behavior.

But, basically, this was added specifically -- per #106 -- to avoid a situation that c.j.j happened to allow that led to even harder-to-debug wrong behavior. So, yes, next.jdbc might prevent some legitimately "OK" stuff but it no longer allows the accidentally wrong-but-very-hard-to-figure-out stuff.

For example, I might consider not locking if *nested-tx* is bound to :ignore which was the c.j.j behavior... Would that satisfy you here?

camsaul commented 1 year ago

Sure, I'd be happy with if we had any workaround at all. A separate option called :unsafe-you-will-regret-this works for us too 😄

I wasn't suggesting you remove it entirely anyway. I'd like to keep benefits of locking there in most cases, just with an escape hatch to disable it in the edge cases where we have a legitimate reason for not wanting it.

Happy to open a PR if you let me know how you'd like it to be done

seancorfield / next-jdbc

next.jdbc + c3p0 deadlocks if you run a query on a different thread #244