Allow users to choose between `json` and `jsonb` in Postgres helpers

sds commented 1 year ago

Very excited for JSON-related helpers added in 7215cc3c4df2a96eedd68d409f63b3350d159c58 to now come built-in to Kysely!

One potential enhancement would be to include json variants alongside the jsonb forms of the helpers.

Depending on what you need the result for, it can be significantly more efficient to use json (especially if you are returning a larger result set) since Postgres doesn't need to perform extra work to construct the tree.

Once you start JSONifying over 100 rows (or wide result sets with many columns) the performance difference becomes noticeable.

Happy to add this if you are supportive. The one point worth clarifying is if we are willing to change the names of the existing helpers (e.g. jsonObjectFrom) to have jsonb in the name (e.g. jsonbObjectFrom) so that the json versions can be easily differentiated.

koskimas commented 1 year ago

I wasn't aware of any noticeable speed difference between them. In that case, is there actually ever a good reason to use jsonb in those helpers? jsonb makes sense as a column type, but as a temporary type, we might just ditch it in favor of json? I'm not a big fan of adding both versions.

sds commented 1 year ago

TL;DR: Both are worth supporting, but if you had to choose, json is the better choice for performance reasons and typical use cases.

The answer is nuanced, but if the intended purpose is to return the result to a client (and not extract JSON properties for comparison in some other query) then json is usually all you need.

If you need to extract values to join or otherwise compare against, then jsonb is necessary because any time Postgres actually needs to parse JSON (for example whenever using the JSON operators) it needs to do the work to construct the jsonb tree anyway.

Here's a brief summary if helpful: Purpose	`json`	`jsonb`	Comments
Remove duplicate keys		✅	May be cases where you want to preserve the underlying data regardless of whether or not it is valid, and so `json` may still be preferred.
Prevent invalid Unicode surrogate pairs		✅	See the comments about RFC 7159 here, but if you're taking data from an external source and dropping it straight into a `json` column, you run the risk of surprising errors when you later attempt to parse the stored value into `jsonb`. This is a reason why it's usually a good idea to use `jsonb` as the column's type, unless you know that you're never going to query anything in the JSON document and will always return the JSON string as-is.
Extract property for comparison in query		✅	If you need to pull out a value for any reason, then you might as well return it as `jsonb` since that's what Postgres will do in order to extract the value, e.g. `WHERE json_column ->> 'property' = '...'`
You just want a JSON string	✅		If none of the above scenarios apply, and you know you just want JSON sent to the client, then `json` is going to be more performant.

Note: if you already know the shape of the JSON you want to output, it's even faster to concatenate strings manually (think fast-json-stringify):

select json_agg(agg) from (select a, b from table limit 10000) as agg;

vs:

select '{"a":' || a || ',"b":"' || b || '"}' from table limit 10000;

The latter is faster, though it assumes you don't have any string values with characters that need escaping, e.g. ". In those cases, you could wrap string columns with to_json if you wanted complete safety, but the more you do this the less performance benefit you'll see.

select '{"a":' || a || ',"b":' || to_json(b) || '}' from table limit 10000;

Let me know if I can clarify anything further.

koskimas commented 1 year ago

Yes, I'm aware of the differences between json and jsonb. It's just surprising to me that building the json object tree would take any meaningful amount of time.

I don't think it'd make much sense to use jsonArrayFrom and jsonObjectFrom as a part of an expression that extracts properties from the result (or any other expression for that matter). They mostly make sense when selecting their results as columns directly. Therefore it could be ok to just use json variants in them.

koskimas commented 1 year ago

Actually I'll write some benchmarks myself to see what kind of difference we are dealing with here. A is faster than B means nothing if the difference is one millionth of the overall query execution time.

sds commented 1 year ago

Would be curious to see your results, but the differences are definitely noticeable for us running on Postgres 14. I'm not aware of any major optimizations introduced in 15.

`json_agg`

indexer_prod> explain analyze select json_agg(agg) from (select a, b from table limit 10000) as agg;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Aggregate  (cost=1713.21..1713.22 rows=1 width=32) (actual time=16.918..16.919 rows=1 loops=1)
  ->  Subquery Scan on agg  (cost=0.00..1688.21 rows=10000 width=99) (actual time=0.012..7.635 rows=10000 loops=1)
        ->  Limit  (cost=0.00..1588.21 rows=10000 width=75) (actual time=0.006..5.441 rows=10000 loops=1)
              ->  Seq Scan on table  (cost=0.00..117911.17 rows=742417 width=75) (actual time=0.005..4.623 rows=10000 loops=1)
Planning Time: 0.766 ms
Execution Time: 17.279 ms

`jsonb_agg`

indexer_prod> explain analyze select jsonb_agg(agg) from (select a, b from table limit 10000) as agg;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Aggregate  (cost=1713.21..1713.22 rows=1 width=32) (actual time=26.342..26.343 rows=1 loops=1)
  ->  Subquery Scan on agg  (cost=0.00..1688.21 rows=10000 width=99) (actual time=0.010..8.686 rows=10000 loops=1)
        ->  Limit  (cost=0.00..1588.21 rows=10000 width=75) (actual time=0.006..6.446 rows=10000 loops=1)
              ->  Seq Scan on table  (cost=0.00..117911.17 rows=742417 width=75) (actual time=0.006..5.616 rows=10000 loops=1)
Planning Time: 0.100 ms
Execution Time: 27.632 ms

Manual string concatenation

indexer_prod> explain analyze select '{"a":' || a || ',"b":' || to_json(b) || '}' from table limit 10000;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Limit  (cost=0.00..1813.21 rows=10000 width=32) (actual time=0.010..12.224 rows=10000 loops=1)
  ->  Seq Scan on table  (cost=0.00..134615.55 rows=742417 width=32) (actual time=0.009..11.392 rows=10000 loops=1)
Planning Time: 0.154 ms
Execution Time: 12.682 ms

koskimas commented 1 year ago

That's 10 milliseconds per 10000 rows. So 1 microsecond per row 😄. If that kinds of differences are meaningful in your system, why are you using node and not, for example, go? Isn't everything else you do with those 10k rows going to dominate that 10 milliseonds 10 times over? Like parsing the DB result in node and serializing it back to JSON for network? Not to mention if you do any kind of operations on those 10k rows.

sds commented 1 year ago

I wasn't trying to emphasize that this only occurs with thousands of rows.

There are many situations where you are selecting more than two columns (which is what the example above was demonstrating for sake of simplicity). Meaningful (milliseconds) differences in execution time can be shown in under 100 rows (i.e. the size of a typical page of a response), but it's highly dependent on your data set.

`json_agg`

explain analyze select json_agg(agg) from (select * from table_a join table_b on table_a.id = table_b.id limit 100) as agg;
...
Execution Time: 2.375 ms

`jsonb_agg`

> explain analyze select jsonb_agg(agg) from (select * from table_a join table_b on table_a.id = table_b.id limit 100) as agg;
...
Execution Time: 5.219 ms

Note: select * is to keep the example simple—I'm not suggesting anyone do this in their queries.

All else being equal, if performance is relevant, you'd rather use json than jsonb. Especially if you are implementing the pattern discussed in the relations recipe (awesome pattern), where you might have multiple relations all combined into a single result.

koskimas commented 1 year ago

My point was more this: With 10k rows (or 100 rows with a huge amount of columns) other parts of the request will also be slow. The total time of a request consists of (at least) these parts:

Query execution time (what you are measuring)
Time it takes for node to parse the result from the DB (JSON.parse + rest of the stuff)
Time it takes to serialize the result for the response (JSON.stringify)
Network delays (although this is a separate issue)

I'd guess JSON.parse and JSON.stringify are at least as slow as the query. So by saving 2ms on the query, you're actually only making a small part of the request a tiny bit faster. Overall, the users probably won't notice a difference.

Having said all that, if there is absolutely zero reasons to keep jsonb, we can switch to json. But even the tiniest reason to keep jsonb will be more important than this performance difference.

koskimas commented 1 year ago

Switched to using json instead of jsonb in the helpers.

virtuallyunknown commented 3 months ago

This may or may not be worth your attention (it's probably a niche case ?), but I've discovered that .distinct() will fail here because you can't use DISTINCT with json_build_object, but you can with jsonb_build_object.

-- ✅ works

SELECT
  json_agg(DISTINCT jsonb_build_object('id', "players"."id")) AS "whatever"
FROM
  "players"

-- ❌ error: could not identify an equality operator for type json

SELECT
  json_agg(DISTINCT json_build_object('id', "players"."id")) AS "whatever"
FROM
  "players"

https://kyse.link/bTzVI

A super easy workaround here is to just grab the helper from the source code and modify it to use jsonb_build_object (just have to add 1 extra character).

kysely-org / kysely