Retrieving large amounts of data from a database

paurkedal / ocaml-caqti

Cooperative-threaded access to relational data

https://paurkedal.github.io/ocaml-caqti/index.html

GNU Lesser General Public License v3.0

298 stars 34 forks source link

Retrieving large amounts of data from a database #24

Open leamingrad opened 5 years ago

leamingrad commented 5 years ago

Firstly, thanks for a great package, caqti has proved to be super-useful for handling database interactions in ocaml.

I'm currently trying to work out how to efficiently extract large amounts of data (multiple GB) from a postgresql database, which I then want to fold over using lwt. Unfortunately, naively using a single SELECT query and fold_s does not work, as caqti loads all of the data in memory, and causes my machine to run out of RAM).

I have had a look through the postgresql documentation, and it seems like there are two ways to do this (if you can think of others I would be happy to hear them):

Break the query into multiple smaller queries using OFFSET /LIMIT batching, potentially with an extra query at the start to count the rows that will be returned
Making use of postgresql's single row mode in order to stream through the results. This is supported by the ocaml postgres package, but isn't currently used by caqti as far as I can tell

I could potentially implement the OFFSET/LIMIT behaviour in my application, but that would mean that I could no-longer make use of the fold_s helper function, which is very useful IMO. Similarly, it probably does not make sense for caqti to implement this sort of batching, as the API does not provide enough detail about the SQL statements being executed to make sure that adding OFFSET and LIMIT parameters would not conflict.

Do you think it would be worth adding the single row mode behaviour to caqti when using the postgres driver - either by default or (more likely) as an option?

I would be happy to put in a PR to add this function.

paurkedal commented 5 years ago

I think your assessment is right. I consider adding streaming retrieval for the PostgreSQL driver, but haven't looked into the details yet. On the other hand the OFFSET/LIMIT solution is best done in the application. It should be possible to compose the fold_s if you first determine the number of batches:

let rec loop batch acc =
  if batch >= batch_count then Lwt.return_ok acc else
  Db.fold_s request f (params, batch) acc >>=? loop (batch + 1)
in
loop 0

where Db is the connection module and (>>=?) is the combined bind for lwt and result monads.

leamingrad commented 5 years ago

Thanks for the prompt reply. I have had a go at implementing the batching solution in my application and it seems to work quite nicely.

I would still be happy to implement single row mode as it would useful going forwards, particularly when using the stream functions added by #22.

I haven't looked into it in detail, but my guess would be that there are three main ways this could be added: 1) Write a second postgresql driver that just uses single row mode for every query, and can then be selected at connection time 2) Provide a parameter when building a Caqti_request.t to indicate it should be executed in single row mode 3) Provide parameters to the fold_s, iter_s and to_stream methods to indicate that single row mode should be used

Do you have a preference for which way would be best? 1 seems like it might be overkill, but it would mean that the other drivers do not have to care about the single row mode config. I think that 2 and 3 would both work, but I think 2 might be a cleaner API.

paurkedal commented 5 years ago

A PR would be much appreciated, and discussing the details now is already useful even if I implement it myself later.

I agree that 1 is an overkill, but see my option 5 below which would provide the same functionality from the user point of view.

You option 2 offers selection per query, while 3 offers selection per query and per query parameters. Both gets a bit invasive on the API, since the parameter must be duplicated across convenience functions, but I think we can find an acceptable solution. I have a slight preference for 3 due to the finer granularity, supplementing the Caqti_connection_sig API (call, iter_s, fold, fold_s) with an optional flag, but let's discuss the details if we go with this.

I think there are two other options of selecting single-row-mode:

Always use single-row-mode when the multiplicity is `Many.
Let the driver accept a query parameter in the URI which activates single-row-mode when the multiplicity is `Many.

These are nice from an API point of view, but if the segmentation of transfers implied by single-row-mode has a significant impact on performance, having better granularity would be good. Option 5 would allow exploring performance impact, as well as allowing the sysadmin to configure the mode according to DB size vs memory size.

So, I think one way forward would be to implement 5, and supplement with 2 or 3 only when/if needed, in which case the URI parameter would act as a default.

leamingrad commented 5 years ago

5 sounds sensible to me. Just to check, you would mean that the user could pass a connection URL that looks like:

postgresql://localhost?single_row_mode=true

to enable single row mode for queries with Many multiplicity. This seems fine from a usability point of view, although it would have to be documented carefully somewhere.

Looking at the documentation here my understanding is that the parameter would be ignored by other clients, which would print a warning to stderr if the URL was accidentally passed somewhere else. I think this is acceptable behaviour.

paurkedal commented 5 years ago

Yes, this was my though precisely. I see your point about the URI being reused, that could even happen if another part of a project use the same configuration parameter to connect directly. But I think we can leave it up to the application to split the URI into two parts, e.g. like

db_uri: "postgresql://localhost/foo?connect_timeout=10"
db_caqti_options: "?single_row_mode=true"

and merge in the second query string before passing the URI to the Caqti connector.

The sqlite3 driver already has some facilities to decode query parameters like get_uri_bool which could be moved to Caqti_driver_lib.

At the moment the URI documentation is in the mli files of the drivers, though would be good to make it more visible. Candidate places to document it would be in Caqti_connect_sig or a separate .mld file.

vog commented 4 years ago

I'd like to argue in favor of point "4.":

4 Always use single-row-mode when the multiplicity is `Many.

The main concern is:

if the segmentation of transfers implied by single-row-mode has a significant impact on performance, having better granularity would be good.

Other database client libraries in other languages usually solve this by using a batch mode. That is, they do perform streaming, but fetch 100 rows at once before further processing. (or 128 rows if you prefer round numbers)

This is usually not configurable by enabling/disabling streaming completely, but instead by allowing to specify the batch size on a per-query basis.

In practice I've never found a reason to adjust those defaults. I did perform experiments and measurements, but never found a batch size that performs significantly better than the defaults. However, in theory this might be desirable if you fetch huge rows, e.g. if every single rows has already 100 MiB.

paurkedal commented 4 years ago

Do you know how batches are implemented in those client libraries? Using OFFSET and LIMIT? The single-row mode in PostgreSQL seems to be limited to sending single tuples at the time. I agree 4 is an attractive option. 5 is useful for experimenting with the performance impact, I think we want single-row by default in either case.

vog commented 4 years ago

Oh, I'm sorry, I just realized that I confused reading and writing. I was describing batch inserts, where the inserts are not executed one-by-one, but also not executed all at once, but instead in batched of around 100 rows. Sorry for the noise!

I'm curious if the streaming reading will actually have a performance impact or not. In interpreted languages, it usually does, but that doesn't tell us anything about OCaml. In the best case, it doesn't, so we can just fully switch to that one in the future.

paurkedal commented 4 years ago

@vog Did they do any benchmarking to pin down the bottleneck? I would think, at least for compiled languages, the slowdown would be due to fragmenting the data into individual packets, rather than in the application itself. If so, all solutions based on the same client library would have this issue.

paurkedal commented 4 years ago

@vog BTW, we are also implementing streaming in the other direction in #26. So far, only generic support, but the libpq API, at least, supports buffering data independent of row-boundaries.

vog commented 4 years ago

@paurkedal Sorry, I did not pin down the bottleneck, I was just tuning parameters. But when was I talked about other programming languages, I had mostly interpreted languages in mind (Python, Ruby, PHP, Perl), so those languages would probably not be good for comparison anyway.

paurkedal commented 4 years ago

What I found on the net about libpq is that the single-row mode performance [1, 2] suggests a slowdown of a factor 2, for a presumably efficient C code running against localhost. I am not sure how this compares to the following benchmarks, since we don't have the absolute row rate from the C benchmarks.

This is less relevant to this issues, but I benchmarked the stream implementation against other options:

Throughputs for "List", "Seq", "Stream_blocking", "Stream_lwt" each running for at least 30 CPU seconds:
           List: 30.13 WALL (30.12 usr +  0.01 sys = 30.12 CPU) @ 3838.26/s (n=115626)
            Seq: 31.41 WALL (31.39 usr +  0.00 sys = 31.39 CPU) @ 1878.52/s (n=58965)
Stream_blocking: 31.62 WALL (31.61 usr +  0.00 sys = 31.61 CPU) @ 1190.19/s (n=37617)
     Stream_lwt: 31.60 WALL (31.58 usr +  0.00 sys = 31.58 CPU) @ 571.30/s (n=18042)

                  Rate    Stream_lwt Stream_blocking           Seq          List
     Stream_lwt  571/s            --            -52%          -70%          -85%
Stream_blocking 1190/s          108%              --          -37%          -69%
            Seq 1879/s          229%             58%            --          -51%
           List 3838/s          572%            222%          104%            --

I committed the code; use dune exec tests/bench_stream.exe to reproduce. The benchmark merely sums up 100 000 floats from a list, seq, or stream, so the computation done per element is minimal. As can be seen, the slowdown compared to iteration over a list is a factor 6.7 of which a factor 2 is due to concurrency and the remaining 3.3 is dune to the stream. I expect the performance ratios to be smaller than this in the real case, since we are also encoding/decoding the tuples and often processing rows wider than a single float.

paurkedal commented 1 year ago

Commit 60f2e01736d9db87494d89da37e2142366a08419 implements single-row mode for multi-row requests when use_single_row_mode=true is passed to the endpoint URI. The performance isn't good though:

╭──────────────┬───────────────────────────┬───────────────────────────┬───────────────────────────╮
│name          │  major-allocated          │  minor-allocated          │  monotonic-clock          │
├──────────────┼───────────────────────────┼───────────────────────────┼───────────────────────────┤
│  fetch-many  │          1345.6000 mjw/run│      17324808.6000 mnw/run│       68667869.0727 ns/run│
╰──────────────┴───────────────────────────┴───────────────────────────┴───────────────────────────╯

for normal mode vs

╭──────────────┬───────────────────────────┬───────────────────────────┬───────────────────────────╮
│name          │  major-allocated          │  minor-allocated          │  monotonic-clock          │
├──────────────┼───────────────────────────┼───────────────────────────┼───────────────────────────┤
│  fetch-many  │          5347.0000 mjw/run│      36301415.0000 mnw/run│     6200377141.0000 ns/run│
╰──────────────┴───────────────────────────┴───────────────────────────┴───────────────────────────╯

for single-row mode, when using a local socket connection. The test bench_fetch_many.ml.txt can be compiled with

(executable
 (name bench_fetch_many)
 (modules bench_fetch_many)
 (libraries
    bechamel bechamel-notty
    caqti caqti.blocking caqti-driver-postgresql
    notty.unix))

paurkedal commented 1 year ago

It looks like the inefficiency is due to polling, with the following incorrect replacement, the single-row mode is within a factor 2 slowdown compared to the normal mode:

diff --git a/caqti-driver-postgresql/lib/caqti_driver_postgresql.ml b/caqti-driver-postgresql/lib/caqti_driver_postgresql.ml
index a2712b6..d2ece8b 100644
--- a/caqti-driver-postgresql/lib/caqti_driver_postgresql.ml
+++ b/caqti-driver-postgresql/lib/caqti_driver_postgresql.ml
@@ -381,18 +381,8 @@ module Connect_functor (System : Caqti_platform_unix.System_sig.S) = struct
        | exception Pg.Error msg -> return (Error msg)
        | socket -> Unix.wrap_fd aux (Obj.magic socket))

-    let get_next_result ~uri ~query db =
-      let rec retry fd =
-        db#consume_input;
-        if db#is_busy then
-          Unix.poll ~read:true fd >>= (fun _ -> retry fd)
-        else
-          return (Ok db#get_result)
-      in
-      try Unix.wrap_fd retry (Obj.magic db#socket)
-      with Pg.Error err ->
-        let msg = extract_communication_error db err in
-        return (Error (Caqti_error.request_failed ~uri ~query msg))
+    let get_next_result ~uri:_ ~query:_ db =
+      return (Ok db#get_result)

     let get_one_result ~uri ~query db =
       get_next_result ~uri ~query db >>=? function