paurkedal / ocaml-caqti

Cooperative-threaded access to relational data
https://paurkedal.github.io/ocaml-caqti/index.html
GNU Lesser General Public License v3.0
309 stars 36 forks source link

Postgresql Copy Mode Support #26

Closed leamingrad closed 4 years ago

leamingrad commented 5 years ago

Further to #24, I am now trying to work out the best way to insert large amounts of data into a postgres database quickly. Based on the documentation here, the best way to do this is to make use of the COPY command to stream in the data.

I have run some experiments, and the use of COPY speeds up data insertion by roughly 10x in the case of my data, so I definitely want to make use of it (although I guess it is not always this much of a speed up).

Unfortunately, COPY is a postgres-specific extension (which is supported by ocaml-postgres but unused by caqti). It seems there is an equivalent for mariadb, but I haven't been able to find a convincing equivalent for sqlite.

This leads to a couple of questions:

My opinion is that this would be a nice addition to caqti. I think that a good equivalent here is transaction support, which seems to trigger different connection code based on the can_transact parameter in the driver_info for a given driver - maybe a can_copy parameter would allow the drivers to declare copy methods, and would raise an error if you tried to use them with a driver that did not support it? This would be combined with some new helper functions for CONNECTION objects.

I am happy to make a PR with and attempt at changes if you think this is worth pursuing.

paurkedal commented 5 years ago

I think this would be a great addition, but it requires some though. I'll think out loud now, but I would also like to hear any further though you've had about the API.

First, I would prefer if Caqti do the encoding or decoding of data sent or received with corresponding type information included in the request type. Apart from type safety and convenience, this offers more abstraction from the RDBMS, which is one of the goals of Caqti. This means we can't just attach auxiliary input/output streaming to existing response-processing functions.

From my reading of the manual the most general solution would allow parameters, result rows, and either input or output stream in the same request. This suggests we would need one or two more parameters for Caqti_request.t to indicate the row type and direction. However, I think it would be reasonable to restrict ourselves to what can be done in a single statement, as this is already a limitation of prepared queries. In that case, we might reuse the type parameter indicating the regular result as the row type of the copy operation, i.e. something like

val copy_in :
  ?env: (Caqti_driver_info.t -> string -> query) ->
  ?oneshot: bool ->
  'a Caqti_type.t -> 'b Caqti_type.t ->
  string -> ('a, 'b, [> `Copy_in]) t

val copy_out :
  ?env: (Caqti_driver_info.t -> string -> query) ->
  ?oneshot: bool ->
  'a Caqti_type.t -> 'b Caqti_type.t ->
  string -> ('a, 'b, [> `Copy_out]) t

However, I think we could unify the latter with regular response processing. This would allow falling back to SELECT for databases without copy-out. We may need to look at the query string to determine which mode we use, unless we provide an alternative to Request.create which receives this information from its callback.

But there is another issue: there are various escape mechanisms indicated by the query string, which would be better chosen by the driver. So, maybe the user should just supply the table and column names instead of the full query. Then we can skip the potentially fragile inspection of the query string to determine the mode, and we could probably also simulate copy-in with INSERT if not available.

So, a revision of the above might look like:

val copy_in :
  ?env: (Caqti_driver_info.t -> string -> query) ->
  ?oneshot: bool ->
  'a Caqti_type.t -> 'b Caqti_type.t ->
  table: string -> ?columns: string list -> unit -> ('a, 'b, [> `Many_in]) t

val copy_out :
  ?env: (Caqti_driver_info.t -> string -> query) ->
  ?oneshot: bool ->
  'a Caqti_type.t -> 'b Caqti_type.t ->
  table: string -> ?columns: string list -> unit -> ('a, 'b, [> `Many]) t

I am open to a copy-in only implementation, but I think it's useful to discuss both directions to see how it will fit into a full solution.

Before we start coding anything: The current function supported by the postgresql bindings are obsolete, so we may want to contribute support for PQputCopyData, PQputCopyEnd, and PQgetCopyData. And we should also make sure the LOAD DATA support in the MariaDB C connector is compatible with the solution we choose.

paurkedal commented 5 years ago

I proposed the addition of the new COPY-handing functions in mmottl/postgresql-ocaml#31.

leamingrad commented 5 years ago

Firstly, thanks for handling the updates to postgres-ocaml that will be needed for this, I hadn't noticed that the exact functions had changed.

I initially envisaged only adding copy-in functionality, but I agree working out an API that suits both is a good way to do this.

Some points on the above:

First, I would prefer if Caqti do the encoding or decoding of data sent or received with corresponding type information included in the request type.

I couldn't agree more - this is definitely the right approach.

But there is another issue: there are various escape mechanisms indicated by the query string, which would be better chosen by the driver. So, maybe the user should just supply the table and column names instead of the full query.

This probably needs broadening slightly as it is possible to copy out from the result of an arbitrary query, but that is not a huge issue here, as you could change copy_out to:

val copy_out :
  ?env: (Caqti_driver_info.t -> string -> query) ->
  ?oneshot: bool ->
  'a Caqti_type.t -> 'b Caqti_type.t ->
  ?table: string -> ?columns: string list -> ?query: string -> unit -> ('a, 'b, [> `Many]) t

and then validate that either a query or a table name and list of columns was passed in.

More generally, a quick question on naming: is it worth changing the naming of the functions to something like stream_in and stream_out, as copy_in and copy_out, sound like they are postgres-specific, and streaming is probably the closest ocaml concept to how this will actually be used?

Other than that, this looks good to me.

paurkedal commented 5 years ago

This probably needs broadening slightly as it is possible to copy out from the result of an arbitrary query,

Do you mean that one can use pqGetCopyData on non-COPY statements? Could you back it up with a link? What about the other direction?

The name changes are okay. I also considered populate for one direction, but can't think about a matching name for the other direction.

leamingrad commented 5 years ago

Do you mean that one can use pqGetCopyData on non-COPY statements?

Fortunately not - I just mean that the docs say that you can do COPY (query) FROM rather than just COPY (table_name) FROM where query must be:

A SELECT, VALUES, INSERT, UPDATE or DELETE command whose results are to be copied. Note that parentheses are required around the query.

paurkedal commented 5 years ago

You are right. But the variant with a general query only applies to COPY ... TO, so if we implement both in their generality, their interfaces will have to differ also in the way the query is specified. I would prefer splitting the COPY ... TO cases into two functions to avoid run-time checking the parameters. But, let's skip this direction, unless we can come up with a good reason for supporting it. My guess would be for only a minor performance difference compared to the plain query, since the COPY variant puts no constraint on functionality, and since both deliver results row by row (assuming single-row mode) using a simple encoding.

There is one think which has been bugging me about my proposed copy_in signature, namely that the second parameter of Caqti_request.t will be used for both input and output. If the type parameter as is to day had variance, then it would loose the variance by being used for both encoding and decoding. So, we should probably add another type parameter.

paurkedal commented 5 years ago

About my last point on how to type in input stream, we could do something like:

type counit
(** The empty type. *)

type ('a, 'b, 'c, +'m) t4 constraint 'm = [< `Zero | `One | `Many]
(** The full request type, generalizing {!t} with the type parameter ['b], being
    the type of a row of an input stream used to populate a database table. *)

type ('a, 'c, 'm) t = ('a, counit, 'c, 'm) t4
(** ... *)
leamingrad commented 5 years ago

That makes sense, but could we just reuse the existing multiplicity type, as it seems like this would duplicate the behaviour?

paurkedal commented 5 years ago

I am not sure if I understand your question or suggestion, but .the 'm parameter still types the multiplicity of the result rows ('c) only. I don't think we need multiplicity for the input stream ('b), since it would be either `Zero or `Many and counit type enforces `Zero multiplicity by having no constructor. The 'm parameters is also mostly a type-safe way to support convenience functions for extracting zero and one rows, so we don't need it for the input stream.

leamingrad commented 5 years ago

Sorry ignore the last comment, I completely misread your suggestion.

Just to check I understand everything, what you are actually proposing is to change the type of Caqti_request.t to add a new 'b parameter to represent the type of a row in the input stream (which pushes the old 'b parameter into the 'c position? This would then collapse into the existing type by setting 'b to counit.

That sounds sensible to me - and I guess for the initial version of the copy in function, 'c would be set to counit as we would expect nothing to be returned by the database during the operation.

As a side question, would you be open to renaming the variables on the type to make this a bit clearer at-a-glance in function signatures etc?

I am going to attempt a first implementation of this over the next few days, as I think the design is now at a point where it should be possible to get something working. I'll focus on the copy-in behaviour on postgres first, as that should require all the new interfaces to be put in place.

paurkedal commented 5 years ago

I think it is good in the long run to rename t4 to t if that's what you mean, but I'd like to have a deprecation phase for the current t first, and only change the type in the next major version, which may be some time into the future yet.

leamingrad commented 5 years ago

Sorry for more confusion here - I actually mean making the type parameters more explicit, so the above becomes something like:

type counit
(** The empty type. *)

type ('params, 'input_row_type, 'output_row_type, +'multiplicity) t4 constraint 'm = [< `Zero | `One | `Many]
(** The full request type, generalizing {!t} with the type parameter ['b], being
    the type of a row of an input stream used to populate a database table. *)

type ('params, 'row_type, 'multiplicity) t = ('params, counit, 'row_type, 'multiplicity) t4
(** ... *)

maybe something a bit less verbose than this, but you get the idea.

paurkedal commented 5 years ago

Yes, that may be a good idea, esp. in the type definition.

leamingrad commented 5 years ago

I have put up a (very rough) first attempt at this for comments in #27, but I have run into quite a few issues when attempting to implement - I am happy to move the discussion there if it is easier.

paurkedal commented 5 years ago

As a slight improvement, populate now uses prepared query. My first idea was to automatically deallocate requests using finalisers, but it's not ideal to use GC for reclaiming external resources, so I opted for adding an explicit deallocate function to the connections API instead.

I also changed populate to handle stream errors (after an adjustment to Stream).

I am planning to do a release soon, so this is a good time for any final comments before fixing the populate and Stream API. I will maybe mark populate experimental, since we don't have any optimised implementation yet.

leamingrad commented 5 years ago

Thanks for the heads up about this. I have a half-baked version of actually using copy mode for populate, which I will try and finish ASAP, but that definitely should not block a release.

On the populate front, I think the changes are all good - and I am a big fan of the deduplication between the drivers. My main question is does the Make_populate functor need to be optional? If it was compulsory all new drivers would have it by default, but then they could just override the function if they have a better implementation? Its not a big deal either way.

For the Stream APIs, the current version looks fine - would it be worth adding a convenience version of to_stream to the connection signature?

paurkedal commented 5 years ago

Yes, we could include populate in Make_convenience and override it, but I think I'll keep it as is for now for simpler code layout. I marked the whole Caqti_connection module internal, so this can be changed between minor versions if needed.

The reason we didn't include a convenience version of to_stream is that it must be possible to delimit the lifetime of streams, since it uses connection state. Each connection can typically only serve a single request-response at the time.

leamingrad commented 5 years ago

That all sounds sensible to me - can I just check I have understood something correctly?

The reason we didn't include a convenience version of to_stream is that it must be possible to delimit the lifetime of streams, since it uses connection state.

I don't quite understand why this is an issue. Is it that a convenience version would end up returning a stream that could be used later in the code, which would present issues because we would no longer be able to control exactly when the query that generates the stream is executed - whereas with the existing convenience functions like fold the query happens straight away?

If this is the issue, then could a potential convenience function just eagerly evaluate the query and then pass around the resulting objects?

paurkedal commented 5 years ago

Your assessment is right. Even if the underlying libraries supported detaching the result retrieval, I think it would be wrong to rely on GC to release the external resources.

Yes, we could copy the result, but then the user might as well use to_list (optionally followed by to_seq).

leamingrad commented 5 years ago

Cool - thanks for the clarification :+1: