seancorfield / next-jdbc

A modern low-level Clojure wrapper for JDBC-based access to databases.
https://cljdoc.org/d/com.github.seancorfield/next.jdbc/
Eclipse Public License 1.0
760 stars 90 forks source link

Sanction/document execute-batch! over plan #165

Closed seancorfield closed 3 years ago

seancorfield commented 3 years ago

From Slack:

(with-open [prepared (next/prepare db [(insert-statement target-table columns)])]
    (next/execute-batch!
      prepared
      (next/plan db [query-str (into-array String ids)] opts)))

This is safe by virtue of the implementation of set-parameters rather than being inherently “safe”. I think I’m going to add notes to execute-batch! and set-parameters so that they preserve this behavior — I think it’s actually rather nice to be able to just execute-batch! over plan to insert the results of a query, so I’ll document that as well.

I think this is a reasonable way to build the insert statement:

  (let [cols [:a :b :c :d :e]
        [sql] (next.jdbc.sql.builder/for-insert-multi :table cols [cols] {})]
    (with-open [ps (next.jdbc/prepare db [sql])]
      (next.jdbc/execute-batch! ps (next.jdbc/plan db [query etc] opts) opts)))

Needs some tests, then document this.

seancorfield commented 3 years ago

@Snurppa So it turns out there are actually a couple of problems with this approach, now that I've had time to run some tests against larger data sets.

The first "problem" is that you end up calling execute-batch! with just one parameter group -- which is effectively the entire output of the plan. That means that run! is called -- as you surmised -- on the reducible and that adds every single row of the plan query as a batch to the prepared statement, and then that single batched statement is dispatched. If your query result is large enough, that's going to hit the limits databases put on batch statement execution.

The second problem is that you cannot specify :batch-size as an option to execute-batch! in this scenario because partition-all -- used to break the parameter groups up so that multiple executeBatch calls are made, which is necessary if you hit the database limit -- cannot be called on a reducible because the first thing it does is call seq which fails (because a reducible isn't seqable).

I tested this by copying 20,000 rows from one table to another in MySQL. My first attempt "worked" because I'd forgotten to add the batch statement rewriting option to the db-spec -- so it sent multiple statements to the database(!). I repeated it with the option in place and got a 20,000-element vector of update counts back (my REPL didn't like that, but it was the "correct" result). I was a bit surprised that MySQL returned -2 for every one of those counts (it means an additional method would need to be called to get the actual update count). To avoid the shock of that large result vector, wrapping it in count is probably reasonable, so here's the code I ended up using:

dev=> (let [cols [:id :name]
 #_=>       [sql] (builder/for-insert-multi :target_table cols [cols] {})]
 #_=>   (with-open [con (jdbc/get-connection ds)
 #_=>               ps  (jdbc/prepare con [sql])]
 #_=>     (count
 #_=>      (jdbc/execute-batch! ps (jdbc/plan ds ["select * from source_table"] 
 #_=>                                         {:builder-fn rs/as-unqualified-arrays})))))

Because of these limitations, I don't feel comfortable actually sanctioning this in the documentation, so I'm going to close this out with no action.