Relax check on missing keys when doing a multi insert.

dharrigan commented 9 months ago

Hi,

Using next.jdbc 1.3.909 on Clojure 1.11.1.

In next.jdbc, for insert-multi!, at these lines, there is a check to determine if the map keys are all the same before doing an insert. If they are not all the same (or if some keys are missing), an exception is thrown.

This is different from how clojure.java.jdbc handles it - in the sense that clojure.java.jdbc allows for multi-inserts even if the keys are different (or missing).

I believe this check in next.jdbc is too strict and should be relaxed, for it is perfectly acceptable to have a collection of maps for insertion, with some of the map keys missing (due to previous logic in the application), knowing that certain columns in a table permit null's or default to known values, thus the absence of keys shouldn't cause an exception (perhaps let the database throw the exception instead, if an attempt is made to insert a row with missing values - as it's the source of truth :-) )

Please could you consider removing this check in next.jdbc to permit maps to having missing or different keys.

Thank you.

-=david=-

seancorfield commented 9 months ago

One issue to consider here is that if I relax this check, because the vector-of-maps is turned into a vector-of-column-names and a sequence of vector-of-row data, any hash maps that are missing columns will get explicit nil (NULL) values inserted as opposed to being omitted from the INSERT and therefore getting whatever DEFAULT value the column has.

Some databases allow DEFAULT as a value so that's a possibility, instead of nil (mapping to NULL), so that's a possible option here.

Do you know if your database(s) support DEFAULT as a value for an INSERT?

Concrete example:

(sql/insert-multi! ds :mytable [{:a 1} {:a 2 :b 3}])`
;;=>
`INSERT INTO mytable (a, b) VALUES (1, DEFAULT), (2, 3)`

That's much harder to do and will not work with batch inserts, so I'd rather not go down that path (this is something that HoneySQL supports for non-batch inserts, BTW, so I'm beginning to lean toward not implementing this at all within next.jdbc and just saying "use HoneySQL" because the friendly SQL functions are not meant to be very sophisticated...

dharrigan commented 9 months ago

Hi Sean,

Both the databases I'm most familiar with, PostgreSQL and MySQL (and I suppose by extension MariaDB) all support using DEFAULT for a column value if the value is missing from the insert.

I guess my feeling is that it is the responsibility of the programmer to deal with any (multi) insertions that don't include all the keys (perhaps they were either stripped out during a previous filter step, or not even supplied for reasons) and to figure out what the best course of action is to take (either ensure the keys are there, or ensure there are defaults in the columns). If an insert is attempted and a column is not nullable and the key (and value) are missing, the database would throw an error and I would hope the programmer would catch that during their testing and do the appropriate thing.

In my case, for the moment, I'm doing a default-with-merge to work around the all-keys must be present check in next.jdbc, e.g.,

(def default-columns {:a nil :b nil :c nil})

(->> (map #(merge default-columns %) coll)
     (map #(select-keys % [:a :b :c]))
     (map #(assoc % :x "x" :y "y"))
     (insert-multi! db :table))

Where it's possible, due to some logic beforehand that b or c could be missing going into the threaded macro for some items in the collection, but not all.

I would side with your reasoning that next.jdbc shouldn't do anything fancy.

seancorfield commented 9 months ago

all support using DEFAULT for a column value if the value is missing from the insert.

That's not what I'm asking. Read the SQL code example that uses DEFAULT (I'm not talking about DEFAULT in DDL).

If an insert is attempted and a column is not nullable and the key (and value) are missing, the database would throw an error and I would hope the programmer would catch that during their testing and do the appropriate thing.

True. And I'm considering going that far -- I'm just giving the caveat that for missing columns in a particular row's hash map, it will insert NULL rather than use the default value, so a nullable column with a default value would get NULL rather than the default, which might surprise some people.

I think that's a niche use case but it does exist.

At this point, I think I would want to make the NULL-for-missing-columns an opt-in option, since it could potentially be slightly dangerous.

dharrigan commented 9 months ago

Hi,

I'm familiar with the DEFAULT, perhaps I worded my reply incorrectly:

foo=> create table bar (id smallint, firstname text not null default 'mizu');
CREATE TABLE
foo=> insert into bar (id, firstname) values (1, DEFAULT);
INSERT 0 1
foo=> select * from bar;
┌────┬───────────┐
│ id │ firstname │
├────┼───────────┤
│  1 │ mizu      │
└────┴───────────┘
(1 row)

Having an opt-in for multi-insert might be a good way forward as you suggest.

-=david=-

seancorfield commented 9 months ago

I've spent some more time looking at c.j.j's behavior and I realized that the "sequence of hash maps" behavior in c.j.j is not like next.jdbc.sql/insert-multi! at all.

In c.j.j, if you use insert-multi! with rows (hash maps), it performs a separate insert for each row -- so the hash maps do not have to be consistent, and missing columns will get whatever default value is defined for that column (not NULL).

In c.j.j, if you use insert-multi! with columns and rows (sequence of vectors), it checks they are all the same size -- no hash maps involved here -- and you get the batched insert behavior that next.jdbc.sql/insert-multi! forces on you.

Here's the docstring of c.j.j's insert-multi!:

  When inserting rows as a sequence of maps, the result is a sequence of the
  generated keys, if available (note: PostgreSQL returns the whole rows). A
  separate database operation is used for each row inserted. This may be slow
  for if a large sequence of maps is provided.

  When inserting rows as a sequence of lists of column values, the result is
  a sequence of the counts of rows affected (a sequence of 1's), if available.
  Yes, that is singularly unhelpful. Thank you getUpdateCount and executeBatch!
  A single database operation should be used to insert all the rows at once.
  This may be much faster than inserting a sequence of rows (which performs an
  insert for each map in the sequence).

So, a "compatibility mode" here could detect the hash maps not all having the same keys and potentially resort to a sequence of single-row insert! calls instead -- with all the attendant performance trade-offs. I think I'd rather push that trade-off directly onto the programmer tho': making them either explicitly augment the rows so they all match, or explicitly doing row-by-row insert! operations.

It looks like you've already gone for the former approach -- which provides the batch insert performance (which you were not getting with c.j.j).

Thoughts?

dharrigan commented 9 months ago

Hi,

Thank you for the investigation. I follow what you're saying up to the last point - are you considering having an opt-in if the programmer knows (in advance) that sometimes their maps may contain missing keys?

If so, then I think that's totally fine - with (of course) documentation around the behaviour of n.j if the opt-in isn't enabled and the maps contain inconsistent keys, and likewise if the opt-in is enabled and maps are inconsistent then n.j would fall-back to doing a per-row insert.

All good stuff! Thank you indeed.

-=david=-

seancorfield commented 9 months ago

are you considering having an opt-in if the programmer knows (in advance) that sometimes their maps may contain missing keys?

"I think I'd rather push that trade-off directly onto the programmer tho': making them either explicitly augment the rows so they all match, or explicitly doing row-by-row insert! operations."

In other words, keep the behavior exactly as-is but clearly document that insert-multi! requires the same keys in all hash maps to get the bulk INSERT and if you want to insert heterogeneous hash maps instead, you must explicitly mapv insert! over your sequence of hash maps.

I don't think it's a good idea to offer an option that can change fast, bulk inserts into slow, sequential inserts silently, just because you have one or more hash maps in a sequence that does not have the same keys as all the others. Throwing an exception, and forcing the programmer to "do the right thing" is a safer approach here.

dharrigan commented 9 months ago

Sounds good to me! :-)

seancorfield commented 9 months ago

That commit is the documentation updates, so you can see what I changed.

seancorfield / next-jdbc

Relax check on missing keys when doing a multi insert. #268