seancorfield / next-jdbc

A modern low-level Clojure wrapper for JDBC-based access to databases.
https://cljdoc.org/d/com.github.seancorfield/next.jdbc/
Eclipse Public License 1.0
767 stars 90 forks source link

Version 1.3.874 is much slower than 1.2.772 #250

Closed ozangulle closed 1 year ago

ozangulle commented 1 year ago

Describe the bug Summary: I tried migrating from version 1.2.772 to 1.3.874 which increased the runtime of a batch process from around 80 seconds to longer than 15 minutes. I stopped execution each time after 15 minutes.

I have a batch process where I'm writing header and body information from a very large mbox file to a sqlite db. In order to save some time, I'm buffering the data in buffers of 1000 and 500 elements long respectively and then writing the data using the insert-multi! function. This worked exceptionally well in next-jdbc version 1.2.772. After migrating to version 1.3.874, writing to db becomes extremely slow.

To Reproduce The next-jdbc code I'm using is

(defn mail-map->vector [mail-map-vector]
  (pmap (fn [mail] 
    [(get mail :sender) (get mail :content-type) (get mail :recipients) (get mail :subject) (get mail :message-id) (get mail :date)]) 
    mail-map-vector)
  )

(defn save-emails
  [mails]
  (sql/insert-multi! ds :emails 
    [:sender :content_type :recipients :subject :message_id :date]
    (mail-map->vector mails)
    {:batch true}))

The rest of the program is very basic, essentially calling save-emails after the buffer is full.

Expected behavior The behavior of the program seems to be the same. It's just much slower.

Stacktraces n/a

project.clj/deps.edn I'm using Leiningen

[org.clojure/clojure "1.10.3"]
[org.clojure/java.data "1.0.95"]
[org.xerial/sqlite-jdbc "3.36.0.2"]
[com.github.seancorfield/next.jdbc "1.3.874"] ; used to be [com.github.seancorfield/next.jdbc "1.2.772"]
[commons-io/commons-io "2.11.0"]
[org.clojure/core.async "1.5.648"]

Environment (please complete the following information):

Additional context n/a

Edit on 2023-04-24: Changed the indentation in the examples to make them easier to read.

seancorfield commented 1 year ago

Can you test on 1.2.780 and 1.2.790? There were changes to insert-multi! in 1.2.790 so if you see the performance change between those two releases, that would likely point the finger at that change.

If so, you might want to try adding :batch true as an option in your call to insert-multi! and see if that makes a difference (in 1.2.790 or later).

ozangulle commented 1 year ago

I tested it on 1.2.780 and 1.2.790. The change in .790 seems to be what's causing this issue. 1.2.780 runs just fine.

I am already using {:batch true} in my calls; you can see it in the example under To Reproduce in the ticket. I assumed this is how you are supposed to pass the options according to the documentation. Am I using it wrong?

seancorfield commented 1 year ago

How were you calling insert-multi! in your original code (presumably, without :batch true since that was added in 1.2.790)?

seancorfield commented 1 year ago

Your mail-map->vector function takes a sequence of hash maps and turns it into a sequence of row values. Since this is a simple operation, I suspect pmap isn't buying you anything here -- the following would likely be faster and is simpler:

(mapv (juxt :sender :content_type :recipients :subject :message_id :date) mails)

Before you added :batch true, with 1.2.772, you would pass a sequence of row values, and it would construct a single INSERT statement with a placeholder for every value to send to the database.

Just upgrading to 1.2.790 (or 1.3.874) without adding :batch true should do the exact same thing and have the exact same performance.

When you add :batch true, however, you are telling next.jdbc that you want the rows batched up and each row added as batch parameter -- and the INSERT statement will contain only placeholders for one row. Per the docs for execute-batch!, not all databases support batch insertion and some of them require specialized connection URL parameters to enable the actual batch operation: so it's likely that you've gone from one insert for your entire mails data to an insert per row of that data, since it's possible SQLite doesn't support batch operations (I don't know -- you'd have to consult its documentation).

In addition, 1.2.790 introduced the ability to pass a sequence of hash maps directly, so you should be able to just do this:

(defn save-emails
  [mails]
  (sql/insert-multi! ds :emails mails))

This will pick apart the columns and row values automatically and perform a single INSERT by default.

ozangulle commented 1 year ago

Thank you, removing :batch true did the trick. I got the same performance in 1.2.790 as well as in 1.3.874. I will check whether it's possible to use batch operations in sqlite. In the meanwhile, just removing the option is working quite well.

Regarding passing the hash map directly; the column names in the db are in snake case whereas the parser returns the hash map keys in kebab case. Adjusting the parser to return the keys in snake case and letting next.jdbc handle it sounds more reasonable though, now that I can migrate to the latest version. Thanks again for your help!

seancorfield commented 1 year ago

Ah, I missed the snake/kebab issue. There's snake/kebab conversion options built into next.jdbc https://cljdoc.org/d/com.github.seancorfield/next.jdbc/1.3.847/api/next.jdbc#snake-kebab-opts