WIP: Add copy mode - Githubissues

leamingrad commented 5 years ago

This is a (very rough) initial attempt at implementing the interfaces discussed in #26.

Making the attempt at implementing has thrown up a number of issues which need to be thought through. Specifically:

Keeping backwards compatibility on the Caqti_request.t type makes everything more complicated. I think it might be easier to drop that requirement, and then make the next release with these changes in a major release to allow people to cope with the API change. I think this would also be mitigated by making sure that the helper functions do not change - so that only people who have been implementing custom Caqti_request.t instances will be affected.
I think that having the drivers contain the logic for how to build their particular streaming queries makes sense, but I am not happy with the stream_in_builder API - is there a better way to do this? Potentially we could extend Caqti_request.query to contain an option to have stream data - and then just handle this differently in each driver
To get the same functionality for sqlite the pattern seems to be to just use a prepared insert query in a single transaction - how would that fit in the current request -> response model?

paurkedal commented 5 years ago

Thanks so far. I'll have a closer look and send you a proper review, but a few comments:

On your first question: Not sure what the issue is without a closer look, but two things I noticed: You provide a conversion from one to the other, which is unsafe as it drops the input stream obligation. And I think the multiplicity parameter should only apply to the output. The t4 type should coincide with t where applicable, rather than needing a conversion.

About your second question: There is no need to register implementations in the driver info. Queries (proper) in Caqti are functions from the driver info to Caqti_request.query, so this is where one can dispatch between different databases. The driver info should only contain only metadata.

About the third question: Yes, we should do that in a transaction, which means we need to start a transaction iff there is not transaction started, which may require keeping track if it ourselves if the C library does not inform us.

A heads up: I'll ask you to remove some unrelated changes from white space to the moving the query to the Caqti_sql module. (The Caqti_request.query is only coincidentally SQL, due to the fact that we only have drivers for SQL databases at the moment. The Caqti_sql module is the residue of the decision to keep SQL out of the queries, but as you can see it's more or less ready for removal.)

paurkedal commented 5 years ago

Another thing: I don't think we need the output direction in this PR, if that makes things easier. (I think now that the two directions need to be handled somewhat differently, anyway, and only the input provides a clear advantage over existing functionality.)

leamingrad commented 5 years ago

Firstly, thanks for the prompt review on this - its really helpful. I am reworking at the moment, but I have a couple of clarifications/points to continue with.

I think the multiplicity parameter should only apply to the output.

This seems sensible. Since we are adding an input type the the full requests, would it also be worth adding an input multiplicity while the modification is being done (even though this won't be used much for now)?

Queries (proper) in Caqti are functions from the driver info to Caqti_request.query, so this is where one can dispatch between different databases. The driver info should only contain only metadata.

I understand this goal, but I'm worried about putting too much logic in the query function. Thinking about the difference between postgres and sqlite, the required query structure is radically different.

To get the right queries, it seems you would either need query logic that is very tightly coupled to the driver inside Caqti_request, or a lot of very specific metadata.

paurkedal commented 5 years ago

This seems sensible. Since we are adding an input type the the full requests, would it also be worth adding an input multiplicity while the modification is being done (even though this won't be used much for now)?

I don't think we'll need it. It's use for the result rows is for type-safeing the helpers for no rows and single rows. In case of the input stream, the only important distinction is whether it's expected or not.

I understand this goal, but I'm worried about putting too much logic in the query function. Thinking about the difference between postgres and sqlite, the required query structure is radically different.

You can always create helper functions and call them, but... maybe you have a point if we consider the possibility of defining external drivers. Then it's better to register the correct query somehow than to rely on support in the core library. So, yes I think your way is better.

paurkedal commented 5 years ago

What if we also had a query_full:

type query_full =
  | Query of query
  | Copy_in of {table: string; columns string list}
  | Copy_out of {table: string; columns string list}
  | Copy_out_query of query

which was used for the create_full function? This would make it possible for the driver to distinguish queries producing a regular result from queries producing a stream without inspecting the text of the query string, and avoid functions in the driver info.

(A reason I am not fond of adding functions to the driver info is that it make it non-serialisable. Today the driver info could be loaded from a configuration file before linking in the driver, though I have no plans of putting that to use for now.)

Another direction would be to skip output streaming, as discussed, and then completely decouple the new functionality as a single function in Caqti_connection_sig.S:

val populate :
  table: string -> columns: string list ->
  'a Caqti_type.t -> 'a Caqti_stream.t ->
  (unit, [> Caqti_error.call_or_retrieve] as 'e) result future

leamingrad commented 5 years ago

Another direction would be to skip output streaming, as discussed

To be honest, I would be happy to add this, rather than adding all the new stuff for queries etc. My main worry is that this seems to cut across the other mechanisms inside caqti, but I don't see that as being a huge problem if you are ok with it.

We could potentially make output streaming work nicely with this if we were only concerned about the table name and columns case - drivers could (optionally) define populate and extract functions to stream in and out respectively in the most efficient way possible for that database type.

If you are definitely ok with this, I'll go ahead with it today, otherwise I'll make a second attempt at the "have a fuller request/query" approach.

paurkedal commented 5 years ago

Good, let's go with populate. Without benchmarking, this covers precisely the part of functionality where we expect the performance gain. And if we decide to change it later, and find the right generalization, then I am fairly sure that populate in this form can be turned into a helper function.

One thing to note is that Caqti_request.t has a function in indexing of prepared queries for a connection. This could be solved by

splitting populate into a Caqti_request and Caqti_connection_sig.S part, or
keeping populate in once place, hash-consing the populate-specs per connection, but
using one-shot queries should also be good enough for now.

I think the first option could still be turned into a helper function in a generalized version, as long as the new intermediate type has no manifest relation to Caqti_request.t.

leamingrad commented 5 years ago

Closing this as #28 has gone in

paurkedal / ocaml-caqti

WIP: Add copy mode #27