Open saurabhnanda opened 7 years ago
Hang on... pressed submit, by mistake.
Regarding the pg_type
query, see:
It seems that this should be a once-per-connection query issued by postgresql-simple
. Have you tried issuing the Opaleye query multiple times and taking the incremental time as your benchmark figure, as opposed to the one-off time?
Okay... I'm done with my monster post!
I should give you notice up front that unless you can convince me that issuing hundreds of thousands of consecutive individual queries with one-row results is something you really need to do then I'm not going to be very inclined to work on this performance issue!
Here are a few things you should check.
The Opaleye query is run through postgresql-simple
. The latter seems to inject some additional bookkeeping queries and you should strike that out of the time taken by Opaleye. We're looking at 14ms for Rails versus 36ms for Opaleye.
You should carefully benchmark the time that Rails and Opaleye take to generate the query string. Optimizing the query will take additional time and I don't want it to make the problem worse.
The single thing that will most encourage me to spend time optimizing Opaleye's query generation is a robust benchmark that I can run each iteration against. The benchmark should break down run time into "query preparation" (converting a Query ...
into a String
) and "query execution" (sending the String
to the database) steps.
I should give you notice up front that unless you can convince me that issuing hundreds of thousands of consecutive individual queries with one-row results is something you really need to do then I'm not going to be very inclined to work on this performance issue!
Thank you for providing the context about your request-response cycle which does indeed convince me that low latency is important for your application.
At this point the most helpful thing you can do is write the benchmark alluded to in my previous comment.
I should give you notice up front that unless you can convince me that issuing hundreds of thousands of consecutive individual queries with one-row results is something you really need to do then I'm not going to be very inclined to work on this performance issue!
We moved away from using JOINs via Opaleye because the nesting is more severe with them. However, even with simple lookups by PK, the queries are way slower. Here's the log from a single request/response cycle along with the reason for why those queries are being made:
-- Fetch the currently signed-in user (along with the tenant, in a multi-tenant app). This is a two table INNER JOIN.
5d857097-9c34-4970-950e-d8c02c706b2d | SQL | SELECT "id0_1" as "result1_3", | 2017-10-20 12:49:43.2178 UTC | 0.014055s (cpu: 7.663e-3) | Affected rows: 1
-- Fetch the top-most object required to construct the JSON response (essentially a findByPk)
5d857097-9c34-4970-950e-d8c02c706b2d | SQL | SELECT "id0_1" as "result1_2", | 2017-10-20 12:49:43.233332 UTC | 0.004322s (cpu: 1.313e-3) | Affected rows: 1
-- Fetch additional master data required to construct the JSON response. Again, a 2-table INNER JOIN.
5d857097-9c34-4970-950e-d8c02c706b2d | SQL | SELECT "booking_ref7_1" as "re | 2017-10-20 12:49:43.237991 UTC | 0.017731s (cpu: 6.5179999999999995e-3) | Affected rows: 1
-- Find by PK on the Customer contacts table (for an existing order/booking)
5d857097-9c34-4970-950e-d8c02c706b2d | SQL | SELECT "id0_1" as "result1_2", | 2017-10-20 12:49:43.256715 UTC | 0.007778s (cpu: 5.362e-3) | Affected rows: 1
-- Find the last payment made for this particular order. Conceptually -- SELECT MAX(created_at)... FROM payments WHERE payment.order_id=?
5d857097-9c34-4970-950e-d8c02c706b2d | SQL | SELECT "result0_2" as "result1 | 2017-10-20 12:49:43.264833 UTC | 0.003222s (cpu: 0.0) | Affected rows: 1
-- Find by PK for product-variant of this order (SKU)
5d857097-9c34-4970-950e-d8c02c706b2d | SQL | SELECT "id0_1" as "result1_2", | 2017-10-20 12:49:43.268322 UTC | 0.003107s (cpu: 1.961e-3) | Affected rows: 1
-- Fetch all the pricing-items (invoice line-items) for this order
5d857097-9c34-4970-950e-d8c02c706b2d | SQL | SELECT "id0_1" as "result1_2", | 2017-10-20 12:49:43.271888 UTC | 0.002005s (cpu: 1.051e-3) | Affected rows: 2
-- Fetch the top-most object required to construct the JSON response. This is needlessly being fetch again -- I just realised while typing this out.
5d857097-9c34-4970-950e-d8c02c706b2d | SQL | SELECT "id0_1" as "result1_2", | 2017-10-20 12:49:43.274129 UTC | 0.001993s (cpu: 1.486e-3) | Affected rows: 1
-- Fetch a few records associated with the top-most object
5d857097-9c34-4970-950e-d8c02c706b2d | SQL | SELECT "id0_1" as "result1_2", | 2017-10-20 12:49:43.276375 UTC | 0.002043s (cpu: 9.49e-4) | Affected rows: 4
Does this query pattern seem reasonable for a typical request/response cycle? If the query times are added it comes to 56ms and the response was sent out in 61ms So, querying the DB efficiently is critical to the kind of webapps we're writing.
A different endpoint written in Rails, which is making 10 SQL queries, is responding in 47.8ms out of which ActiveRecord is taking only 11.7ms! The DB layer is essentially 5x faster.
You should carefully benchmark the time that Rails and Opaleye take to generate the query string.
Isn's this a foregone conclusion? Query strings generated by Opaleye are easily 40-50x longer than the ones generated by Rails, due to enumeration all column names across multiple levels of nesting.
The single thing that will most encourage me to spend time optimizing Opaleye's query generation is a robust benchmark that I can run each iteration against. The benchmark should break down run time into "query preparation" (converting a Query ... into a String) and "query execution" (sending the String to the database) steps.
Is nesting the query a necessity? Can't multiple levels of nesting be collapsed in the internal Opaleye AST itself, or while converting the AST to the query-string? Your thoughts?
You should carefully benchmark the time that Rails and Opaleye take to generate the query string.
Isn's this a foregone conclusion? Query strings generated by Opaleye are easily 40-50x longer than the ones generated by Rails, due to enumeration all column names across multiple levels of nesting.
I'm not saying you should benchmark them to see who's faster. I'm saying you should benchmark them to see what proportion of Opaleye's time is spent in query generation as opposed to query execution.
Is nesting the query a necessity?
No
Can't multiple levels of nesting be collapsed in the internal Opaleye AST itself, or while converting the AST to the query-string?
Yes
Your thoughts?
My thought is I'm not going to implement that until I have a robust benchmark.
My thought is I'm not going to implement that until I have a robust benchmark.
If I'm going to spend my time doing something that I want to be able to very carefully assess its impact.
The benchmark should break down run time into "query preparation" (converting a Query ... into a String) and "query execution" (sending the String to the database) steps.
Am I missing something here? Doesn't the data shared above demonstrate that query execution ( on PG side) is a perf-bottleneck? Do you have another hypothesis about query generation (on Opaleye side) also being a perf-bottleneck?
Do you have another hypothesis about query generation (on Opaleye side) also being a perf-bottleneck?
Yes, I suspect query generation is also a bottleneck. It manipulates Haskell String
s which are known to perform poorly!
In order to fix this performance issue I need a robust way of getting performance numbers out for both the generation and execution stages.
In order to fix this performance issue I need a robust way of getting performance numbers out for both the generation and execution stages.
Execution stage would basically be an instrumented version of Opaleye.runQuery
, right? And generation would be instrumentation over prepareQuery
, correct?
Yes, I would say split runQueryExplicit
into two bits and time them separately.
prepareQuery
bit (be careful to make sure the query string is fully evaluated, maybe with deepSeq
)queryWith_
bitWhere are we on this?
We are wrapping up all docs/fixes that we came across while building our Haskell module. This will be picked once all of that is over. A few weeks.
On 30-Oct-2017 1:22 PM, "tomjaguarpaw" notifications@github.com wrote:
Where are we on this?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tomjaguarpaw/haskell-opaleye/issues/340#issuecomment-340367672, or mute the thread https://github.com/notifications/unsubscribe-auth/AABu0caXDGGxzyDUE42nje3NzvTHoqElks5sxYA-gaJpZM4QAlxZ .
We are wrapping up all docs/fixes that we came across while building our Haskell module. This will be picked once all of that is over. A few weeks.
I'm planning to pick this up now. How would you like the benchmarking tooling to work? What would you like your dev workflow to be?
I need some function that I can run that will produce performance statistics for the cases you care about. I would like statistics about the run time of query generation and also the run time of query execution. I don't think I mind much more than that.
I need some function that I can run that will produce performance statistics for the cases you care about. I would like statistics about the run time of query generation and also the run time of query execution. I don't think I mind much more than that.
So, just a function (or bunch of functions) that can be run via GHCi? Will running it via GHCi cause the perf numbers to vary wildly? Do you use byte-code or object-code in GHCi? Will that have any impact?
You can also have a main
function that wraps those functions into an executable if you like.
So, let's benchmark the following use-cases:
Anything else you (or anyone else?) can think of?
Seems fine to me. We can always add new use-cases later.
Why has prepareQuery
been marked as deprecated? It is being superseded by which function?
It's been deprecated because it's going to be moved into an internal module. It's fine for you to use it in this benchmark harness.
I think in view of #476 this issue should be re-opened. Probably prepare if for Hackoberfest so that we can tackle this as a community and get it fixed once and for all.
Yes, would be great to have a solid benchmark suite so that we can see what difference any improvements make and ensure that we don't get performance regressions.
I will look again at the benchmarks in https://github.com/tomjaguarpaw/haskell-opaleye/pull/347
My idea about this is that instead of generating unique column names each time we use a select we instead generate unique tables names. That means that instead of
SELECT
"col10_1" as "result1_3",
"col21_1" as "result2_3",
"col10_2" as "result3_3",
"col21_2" as "result4_3"
FROM (SELECT
*
FROM (SELECT
"col1" as "col10_1",
"col2" as "col21_1"
FROM "foo" as "T1") as "T1",
(SELECT
"col1" as "col10_2",
"col2" as "col21_2"
FROM "foo" as "T1") as "T2") as "T1"
we could generate
SELECT
T1."col1" as "result1_3",
T1."col2" as "result2_3",
T2."col1" as "result3_3",
T3."col2" as "result4_3"
FROM "foo" as "T1", "foo" as "T2"
An orthogonal simplification would be do combine clauses where valid. For example, instead of
SELECT
"col10_1" as "result1_2",
"col21_1" as "result2_2"
FROM (SELECT
*
FROM (SELECT
*
FROM (SELECT
*
FROM (SELECT
*
FROM (SELECT
*
FROM (SELECT
"col1" as "col10_1",
"col2" as "col21_1"
FROM "foo" as "T1") as "T1") as "T1"
OFFSET 10) as "T1") as "T1"
LIMIT 20) as "T1") as "T1"
we could generate
SELECT
"col10_1" as "result1_2",
"col21_1" as "result2_2"
FROM (SELECT
*
FROM (SELECT
*
FROM (SELECT
*
FROM (SELECT
*
FROM (SELECT
"col1" as "col10_1",
"col2" as "col21_1"
FROM "foo" as "T1") as "T1") as "T1"
OFFSET 10 LIMIT 20) as "T1") as "T1") as "T1"
With transformations like this we could collapse a lot of the subquery nesting.
There's a nice improvement on the branch mergeLimitOffset (cae29508).
Before:
*Main Opaleye Data.Profunctor.Product Data.String> let q = let mytable = table "mytable" (p2 (requiredTableField "f1", requiredTableField "f2")) :: Table (Field SqlInt4, Field SqlInt4) (Field SqlInt4, Field SqlInt4) in limit 30 $ offset 10 $ offset 10 $ limit 100 (selectTable mytable)
*Main Opaleye Data.Profunctor.Product Data.String> mapM_ putStrLn $ showSql q
SELECT
"f10_1" as "result1_2",
"f21_1" as "result2_2"
FROM (SELECT
*
FROM (SELECT
*
FROM (SELECT
*
FROM (SELECT
*
FROM (SELECT
"f1" as "f10_1",
"f2" as "f21_1"
FROM "mytable" as "T1") as "T1"
LIMIT 100) as "T1"
OFFSET 10) as "T1"
OFFSET 10) as "T1"
LIMIT 30) as "T1"
After
*Main Opaleye Data.Profunctor.Product Data.String> let q = let mytable = table "mytable" (p2 (requiredTableField "f1", requiredTableField "f2")) :: Table (Field SqlInt4, Field SqlInt4) (Field SqlInt4, Field SqlInt4) in limit 30 $ offset 10 $ offset 10 $ limit 100 (selectTable mytable)
*Main Opaleye Data.Profunctor.Product Data.String> mapM_ putStrLn $ showSql q
SELECT
"f10_1" as "result1_2",
"f21_1" as "result2_2"
FROM (SELECT
*
FROM (SELECT
"f1" as "f10_1",
"f2" as "f21_1"
FROM "mytable" as "T1") as "T1"
LIMIT 30
OFFSET 20) as "T1"
This certainly looks a lot easier to debug! I'm still sceptical it brings performance improvements, but definitely looks worth having
Actually, I take that back - if the pretty printer is slow, then printing less is almost definitely going to be faster!
@saurabhnanda wrote some benchmarks that do show performance improvements when cruft is removed from small queries. I haven't specifically targeted those benchmarks with these improvements but I'm hopeful that something good for performance will come out of it.
Hi, here is another data point of where opaleye's nested queries make postgres slower than needed.
We have a work-queue implemented in postgres, with the following code.
It does the equivalent of:
SELECT *
FROM myproject.jobs
WHERE "status_code" = 'pending' AND "queue" 'myqueue'
ORDER BY "priority" DESC NULLS FIRST,
"attempts" ASC NULLS LAST,
"creation_time" ASC NULLS LAST
In Opaleye:
proc () -> do
row <- orderBy (desc jrPriority <> asc jrAttempts <> asc jrCreationTime) (jobQuery hdl) -< ()
restrict -< jrQueue row .== pgStrictText queue
codeRestrict -< row
returnA -< row
where
codeRestrict = case mbCode of
Nothing -> proc _row -> returnA -< ()
Just code -> proc row -> restrict -< jrStatusCode row .== pgStrictText code
Opaleye-generated query:
SELECT
"id0_1" as "result1_2"
FROM (SELECT
*
FROM (SELECT
*
FROM (SELECT
*
FROM (SELECT
*
FROM (SELECT
*
FROM (SELECT
"id" as "id0_1",
"queue" as "queue1_1",
"priority" as "priority2_1",
"creation_time" as "creation_time3_1",
"payload_json" as "payload_json4_1",
"status_code" as "status_code5_1",
"attempts" as "attempts6_1",
"worker" as "worker7_1",
"start_time" as "start_time8_1",
"end_time" as "end_time9_1",
"last_heartbeat" as "last_heartbeat10_1",
"result_json" as "result_json11_1"
FROM "myproject"."jobs" as "T1") as "T1") as "T1"
ORDER BY "priority2_1" DESC NULLS FIRST,
"attempts6_1" ASC NULLS LAST,
"creation_time3_1" ASC NULLS LAST) as "T1"
WHERE (("status_code5_1") = (CAST(E'pending' AS text))) AND (("queue1_1") = (CAST(E'myqueue' AS text)))) as "T1"
LIMIT 1) as "T1") as "T1"
Performance:
postgres=# EXPLAIN ANALYZE SELECT "id0_1" as "result1_2" FROM (SELECT * FROM (SELECT * FROM (SELECT * FROM (SELECT * FROM (SELECT * FROM (SELECT "id" as "id0_1", "queue" as "queue1_1", "priority" as "priority2_1", "creation_time" as "creation_time3_1", "payload_json" as "payload_json4_1", "status_code" as "status_code5_1", "attempts" as "attempts6_1", "worker" as "worker7_1", "start_time" as "start_time8_1", "end_time" as "end_time9_1", "last_heartbeat" as "last_heartbeat10_1", "result_json" as "result_json11_1" FROM "myproject"."jobs" as "T1") as "T1") as "T1" ORDER BY "priority2_1" DESC NULLS FIRST, "attempts6_1" ASC NULLS LAST, "creation_time3_1" ASC NULLS LAST) as "T1" WHERE (("status_code5_1") = (CAST(E'pending' AS text))) AND (("queue1_1") = (CAST(E'myqueue' AS text)))) as "T1" LIMIT 1) as "T1") as "T1";
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Subquery Scan on "T1" (cost=12574.94..12574.96 rows=1 width=16) (actual time=34.448..34.449 rows=1 loops=1)
-> Limit (cost=12574.94..12574.95 rows=1 width=220) (actual time=34.448..34.448 rows=1 loops=1)
-> Subquery Scan on "T1_1" (cost=12574.94..12800.85 rows=18073 width=220) (actual time=34.446..34.446 rows=1 loops=1)
-> Sort (cost=12574.94..12620.12 rows=18073 width=220) (actual time=34.445..34.445 rows=1 loops=1)
Sort Key: "T1_2".priority DESC, "T1_2".attempts, "T1_2".creation_time
Sort Method: quicksort Memory: 3201kB
-> Index Scan using jobs_queue_status_code_idx on jobs "T1_2" (cost=0.42..10173.63 rows=18073 width=220) (actual time=0.031..15.184 rows=31133 loops=1)
Index Cond: ((queue = 'myqueue'::text) AND (status_code = 'pending'::text))
Planning time: 0.240 ms
Execution time: 34.646 ms
Simplifying the query, removing Opaleye's nesting, improves performance 2x (34 ms -> 18 ms), removing the Subquery Scan
:
SELECT
"id",
"queue",
"priority",
"creation_time",
"payload_json",
"status_code",
"attempts",
"worker",
"start_time",
"end_time",
"last_heartbeat",
"result_json"
FROM "myproject"."jobs"
WHERE ("status_code" = (CAST(E'pending' AS text))) AND (("queue") = (CAST(E'myqueue' AS text)))
ORDER BY "priority" DESC NULLS FIRST,
"attempts" ASC NULLS LAST,
"creation_time" ASC NULLS LAST
LIMIT 1;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=10264.00..10264.00 rows=1 width=1106) (actual time=18.530..18.531 rows=1 loops=1)
-> Sort (cost=10264.00..10309.18 rows=18073 width=1106) (actual time=18.530..18.530 rows=1 loops=1)
Sort Key: priority DESC, attempts, creation_time
Sort Method: top-N heapsort Memory: 25kB
-> Index Scan using jobs_queue_status_code_idx on jobs (cost=0.42..10173.63 rows=18073 width=1106) (actual time=0.030..12.072 rows=31133 loops=1)
Index Cond: ((queue = 'myqueue'::text) AND (status_code = 'pending'::text))
Planning time: 0.099 ms
Execution time: 18.553 ms
Also note how the sort changed:
-Sort Method: quicksort Memory: 3201kB
+Sort Method: top-N heapsort Memory: 25kB
@nh2 Can you also share the PostgreSQL version this was ran on?
Postgresql 9.6 with recommended settings for SSD (changing to random_page_cost = '1.1'
, which can affect query planner results).
Thanks. That's a fair few major versions back now, I wonder what PG 14 would make of the above query. https://dbfiddle.uk and friends might be able to help here, though it's a bit of work. (I'm also not saying "oh just upgrade" - obviously if we can fix this for PG 9.6, we should).
Thanks for the report @nh2. Could you please check that the performance degradation is not due to the relative placement of ORDER BY
and WHERE
? Specifically, in
orderBy (desc jrPriority <> asc jrAttempts <> asc jrCreationTime) (jobQuery hdl)
could you please put the restrict
inside the orderBy
, wrapping the (jobQuery hdl)
?
If that doesn't recapture the missing performance then I will investigate how to remove these nested queries to see if that does the job. I'm not sure how I can most effectively proceed though, given that I have neither Postgres 9.6 (Debian stable is on 13.7) nor any data to run the query on. Suggestions welcome.
Oh, I see the plan does the Limit
before the Sort
in any case, so perhaps the hope of improvements from that are forlorn. In any case I would be grateful if you could try out my suggestion, just to confirm.
EDIT: Hmm, I don't mean Limit
before Sort
, I mean Index Scan
before Sort
.
@tomjaguarpaw I can try to do that, but it may take some time.
I should also be able to try the same query on a local Postgres 13 or 14.
neither Postgres 9.6
That one is easy with Nix:
$ nix-shell -p postgresql_9_6 --run 'postgres --version'
postgres (PostgreSQL) 9.6.24
OK, I understand what is happening in the EXPLAIN ANALYZE
s. In the handwritten version it knows it can get away with a top-N heapsort, because it knows it is subsequently taking LIMIT 1
. In the Opaleye version it uses quicksort, and (presumably) sorts the whole table, before discarding almost all of it.
I have to say, I think this is a weakness of Postgres, and I wouldn't be surprised if it were fixed in a version later than 9.6 (the first version in the 9.6 series was released nearly 6 years ago). So please check with
restrict
inside the orderBy
(wrapping (jobQuery hdl)
)In principle I'm willing to help try to fix this regardless, but I shall certainly be more motivated to do so if it's still bad when you try one, or especially both, of the above.
I believe that upgrading to PG 14 won't fix this problem: https://dbfiddle.uk/?rdbms=postgres_14&fiddle=943a8f083302602b734c290242066432
Here's a bit more info: https://dbfiddle.uk/?rdbms=postgres_14&fiddle=4a1eb1d20a03aa3ca0cc8ca1bd17d000
It looks like the important thing is that the LIMIT
and the ORDER
are combined.
Hmm, the "optimal" version there uses quicksort though (and they all have the same predicted "cost", but I'm not if that's meaningful without data in the table).
I was more looking at the overall plan and when a subquery scan happened. I agree populating the table will probably matter too
Does anyone understand what a subquery scan
actually is? According to etutorials.org
Each of these operators [subquery scan and subplan] are used for internal bookkeeping purposes and really don't affect the overall query plan?you can usually ignore them.
So do they actually have any run time effect?
This thread on the pgsql-performance mailing list seems to refer to a similar thing, and the reply notes some limitations on rewriting queries.
Related to #334 & #284
Some comparative numbers while running the same "fetch by PK" type of query in Rails/ActiveRecord and Opaleye. The query generated by Opaleye is ~3-8x slower. (compare the query-timing from PG logs for the 3x difference. 8x difference is observed as soon as a similar query is re-fired -- it seems that bind/execute, which Rails is firing, caches better at the PG-level, compared to what Opaleye is firing).
Rails/ActiveRecord
Timing from PG Logs: 14.046ms (total of parse, bind, & execute step)
Opaleye
Timing from a wrapper written on top of Opaleye.runQuery: 53.026ms
Timing from PG Logs: 40.622 (total of all three queries given below)
Questions to ponder:
SELECT - FROM - WHERE - ORDER - LIMIT/OFFSET
be applied in a single SQL query without nesting? The same question can be extended to JOINs (#284)Conclusion
For a RDBMS-backed webapp, which is making 4-10 queries in every request/response cycle, this performance degradation adds up really quickly. Similar endpoints written in Rails are responding under 80ms, and the average response time in Haskell is about 250-300ms, most of which can be attributed the queries being generated by Opaleye.
If this isn't fixed, Opaleye's killer feature (query composition), ends-up being a great PoC, but unusable in production. What can we do to fix this?