rmculpepper / sql

Embedding of some of SQL into Racket
33 stars 5 forks source link

Insert a dynamic number of rows #9

Closed LiberalArtist closed 5 years ago

LiberalArtist commented 5 years ago

I am about to have to modify some code that generates SQL statements like:

INSERT INTO somewhere (a, b, c)
VALUES ($1, $2, $3), ($4, $5, $6)

but where the number of rows to be inserted is only known at runtime.

I would love to use this library's unquote parameters to replace painful manual process of keeping query parameters properly numbered and in sync with the actual arguments list, but this library doesn't currently seem to support inserting a dynamic number of rows. (At least I can't see a way to do it.) I'm very interested in trying to make a pull request to support this, but it isn't immediately clear to me what "The Right Thing" to do would be.

(I'm also interested in supporting queries like SELECT title FROM books WHERE id IN ($1, $2, $3) with a dynamic number of query parameters, though that's less urgent for me personally, and I can see better work-arounds in that case.)

I'm not deeply familiar with the SQL grammar, and the sources I've looked (e.g. for PostgreSQL and SQLite) seem more focused on the concrete syntax than on explaining the structure in terms of nonterminals. (I don't have the Date and Darwin book and probably can't get it before I need to work on this code.)

I can see a few possible approaches, and I'm interested in which would be best to pursue:

  1. Personally, I'd most like to use an extended version of the #:set [column-ident scalar-expr] ... syntax: I much prefer addressing columns by name rather than by position, and I could imagine adding a check to make sure that all rows to be inserted declare the same columns. However, it seems like it might be better to implement this as a derived construct.
  2. The approach most consistent with the current design would seem to be adding an unquote-splicing case to, say, ScalarExpr:AST, but, as I said, I'm not totally clear on what the right grammar element would be. (It seems a lot like the $%row "special scalar expression," but they might just use the same concrete syntax.) Also, there are cases when a single occurrence of a grammar element is allowed but a splice would not be.
  3. There could be a function like make-values*-table-ast with the contract (-> (listof (listof scalar-expr-ast?)) table-expr-ast?).
  4. Based on https://github.com/rmculpepper/sql/issues/2#issuecomment-315628265, there could be a function named something like syntax->table-expr-ast or parse-table-expr-ast with the contract (-> syntax? table-expr-ast?).
rmculpepper commented 5 years ago

You're right, this library doesn't provide a way of expressing INSERTS of a dynamically-determined number of rows. Do you really need a single INSERT statement that inserts multiple rows, or could you just use multiple statements? If the round-trip time is a critical issue, it might make more sense to add some sort of batch query operation to the db library.

Otherwise, I think approach 3 would be best. I'm not seeing how approach (1) would work. Do you have an example of how you would expect an INSERT with a dynamically-determined number of rows would be expressed? (2) would complicate the grammar; those are not ScalarExpressions, they're a separate thing that would need a name and an ast type and predicate etc. (4) is the most powerful and flexible in the long run, but it would require the most thought and effort to get working.

LiberalArtist commented 5 years ago

Yes, my concern is round-trip time, and a batch operation in db would be a more direct way to address that. The operations I'm about to deal with only happen when the long-running program starts up, so I could live with degraded performance for a while, but I insert roughly 35,000 rows, and my guess is that running 35,000 statements would be noticeably slower.

(Writing this reminded me that I've already been doing these in groups of no more than 1000 rows per INSERT statement because doing them all at once would give me errors like "query-exec: wrong number of parameters for query expected: 19494 given: 85030" for large numbers of rows. I assumed that I was hitting some kind of limit on the PostgreSQL side, but it just occurred to me that it could be something in the db library. If it would be useful, I can try to make a minimal example and open an issue on that repository.)

For approach (1), I had imagined writing an INSERT statement with multiple rows in a syntax like:

(insert #:into the_numbers
         #:set [n 0] [d "nothing"]
         #:set [n 1] [d "the loneliest number"]
         #:set [n 2] [d "company"]
         #:set [n 3] [d "a crowd"])

But, as you note, that doesn't by itself address supporting a dynamically determined number of rows: I saw it more as an extension that could be added in combination with (4) (i.e. a syntax->table-expr-ast function) or something else.

I am reasonably confident that I could implement (3), i.e. a make-values*-table-expr-ast function. It seems like (4) and a batch query operation at the db level would both be broadly useful, but more challenging to implement. For (4), I think I have enough of a sense of how this library is implemented to get started, but I'm guessing there are subtleties I haven't thought through. Adding a batch query operation on the db side seems most generally useful (i.e. not specific to this library), but also most difficult for me to try to implement. I've looked at the implementation of db, but I'm not very familiar with it, and I don't know if the various supported database systems provide some sort of notion of batch queries to build on. I can continue to look into this further.

rmculpepper commented 5 years ago

The PostgreSQL protocol imposes a limit of 2^16-1 parameters; when I tried preparing a statement with more than the resulting prepared statement claimed to take 0 parameters, and there was no warning issued by the server. The db library further limits that to 2^15-1 because I read the parameter count as a signed integer; I'll fix that.

Ah, I see. Yes, (1) and (4) go together. I'd rather not do (4) now, though.

I just pushed a fix to the handling of unquoted scalar values and dynamic AST composition (eg, (NT:AST ,ast-expr)). I also added a new module, private/dynamic.rkt, with a few dynamic AST constructors. If you want to make a PR addingmake-values*-table-expr-ast` there, it should be relatively straightforward now. Or if you want me to add it, I can do that too.

LiberalArtist commented 5 years ago

An update: I've done an initial rewrite of the code I was referring to using this library instead of wrangling SQL strings manually. I haven't done any formal measurements, but, as expected, making 35,000 round trips to the database does seem to be appreciably slower. Fortunately (sort of), this code was already fairly slow and, as I said, only affects startup time, so I can live with the bad behavior while I think about this further.

Adding make-values*-table-expr-ast seems straightfroward, and I'll probably do that. Having now actually rewritten the SQL generation code, though, the naïve version is wonderfully clear and readable, and I'm loathe to loose that. (That holds beyond these specific insert statements: the rewritten version, which adds some new functionality, is about the same number of lines of code as the string-wrangling version and is far more readable.) I'll probably try to address that with a macro on my side implemented over make-values*-table-expr-ast, but if I get ambitious I may look into adding a batch query operation to db.

Thanks for all the help with this!

LiberalArtist commented 5 years ago

While thinking about this, I noticed that (table-expr-qq (values* (1 2 3) (1 2))) doesn't raise a syntax error. Unless there is some obscure facet of SQL I don't know about, it would seem like that wouldn't be a legal expression. If I'm right, I'd be happy to make a PR adding a check that the lists are the same length.