Closed sds closed 1 year ago
I wasn't aware of any noticeable speed difference between them. In that case, is there actually ever a good reason to use jsonb
in those helpers? jsonb
makes sense as a column type, but as a temporary type, we might just ditch it in favor of json
? I'm not a big fan of adding both versions.
TL;DR: Both are worth supporting, but if you had to choose, json
is the better choice for performance reasons and typical use cases.
The answer is nuanced, but if the intended purpose is to return the result to a client (and not extract JSON properties for comparison in some other query) then json
is usually all you need.
If you need to extract values to join or otherwise compare against, then jsonb
is necessary because any time Postgres actually needs to parse JSON (for example whenever using the JSON operators) it needs to do the work to construct the jsonb
tree anyway.
Here's a brief summary if helpful: Purpose | json |
jsonb |
Comments |
---|---|---|---|
Remove duplicate keys | β | May be cases where you want to preserve the underlying data regardless of whether or not it is valid, and so json may still be preferred. |
|
Prevent invalid Unicode surrogate pairs | β | See the comments about RFC 7159 here, but if you're taking data from an external source and dropping it straight into a json column, you run the risk of surprising errors when you later attempt to parse the stored value into jsonb . This is a reason why it's usually a good idea to use jsonb as the column's type, unless you know that you're never going to query anything in the JSON document and will always return the JSON string as-is. |
|
Extract property for comparison in query | β | If you need to pull out a value for any reason, then you might as well return it as jsonb since that's what Postgres will do in order to extract the value, e.g. WHERE json_column ->> 'property' = '...' |
|
You just want a JSON string | β | If none of the above scenarios apply, and you know you just want JSON sent to the client, then json is going to be more performant. |
Note: if you already know the shape of the JSON you want to output, it's even faster to concatenate strings manually (think fast-json-stringify
):
select json_agg(agg) from (select a, b from table limit 10000) as agg;
vs:
select '{"a":' || a || ',"b":"' || b || '"}' from table limit 10000;
The latter is faster, though it assumes you don't have any string values with characters that need escaping, e.g. "
. In those cases, you could wrap string columns with to_json
if you wanted complete safety, but the more you do this the less performance benefit you'll see.
select '{"a":' || a || ',"b":' || to_json(b) || '}' from table limit 10000;
Let me know if I can clarify anything further.
Yes, I'm aware of the differences between json
and jsonb
. It's just surprising to me that building the json object tree would take any meaningful amount of time.
I don't think it'd make much sense to use jsonArrayFrom
and jsonObjectFrom
as a part of an expression that extracts properties from the result (or any other expression for that matter). They mostly make sense when selecting their results as columns directly. Therefore it could be ok to just use json
variants in them.
Actually I'll write some benchmarks myself to see what kind of difference we are dealing with here. A is faster than B means nothing if the difference is one millionth of the overall query execution time.
Would be curious to see your results, but the differences are definitely noticeable for us running on Postgres 14. I'm not aware of any major optimizations introduced in 15.
json_agg
indexer_prod> explain analyze select json_agg(agg) from (select a, b from table limit 10000) as agg;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1713.21..1713.22 rows=1 width=32) (actual time=16.918..16.919 rows=1 loops=1)
-> Subquery Scan on agg (cost=0.00..1688.21 rows=10000 width=99) (actual time=0.012..7.635 rows=10000 loops=1)
-> Limit (cost=0.00..1588.21 rows=10000 width=75) (actual time=0.006..5.441 rows=10000 loops=1)
-> Seq Scan on table (cost=0.00..117911.17 rows=742417 width=75) (actual time=0.005..4.623 rows=10000 loops=1)
Planning Time: 0.766 ms
Execution Time: 17.279 ms
jsonb_agg
indexer_prod> explain analyze select jsonb_agg(agg) from (select a, b from table limit 10000) as agg;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1713.21..1713.22 rows=1 width=32) (actual time=26.342..26.343 rows=1 loops=1)
-> Subquery Scan on agg (cost=0.00..1688.21 rows=10000 width=99) (actual time=0.010..8.686 rows=10000 loops=1)
-> Limit (cost=0.00..1588.21 rows=10000 width=75) (actual time=0.006..6.446 rows=10000 loops=1)
-> Seq Scan on table (cost=0.00..117911.17 rows=742417 width=75) (actual time=0.006..5.616 rows=10000 loops=1)
Planning Time: 0.100 ms
Execution Time: 27.632 ms
indexer_prod> explain analyze select '{"a":' || a || ',"b":' || to_json(b) || '}' from table limit 10000;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..1813.21 rows=10000 width=32) (actual time=0.010..12.224 rows=10000 loops=1)
-> Seq Scan on table (cost=0.00..134615.55 rows=742417 width=32) (actual time=0.009..11.392 rows=10000 loops=1)
Planning Time: 0.154 ms
Execution Time: 12.682 ms
That's 10 milliseconds per 10000 rows. So 1 microsecond per row π. If that kinds of differences are meaningful in your system, why are you using node and not, for example, go? Isn't everything else you do with those 10k rows going to dominate that 10 milliseonds 10 times over? Like parsing the DB result in node and serializing it back to JSON for network? Not to mention if you do any kind of operations on those 10k rows.
I wasn't trying to emphasize that this only occurs with thousands of rows.
There are many situations where you are selecting more than two columns (which is what the example above was demonstrating for sake of simplicity). Meaningful (milliseconds) differences in execution time can be shown in under 100 rows (i.e. the size of a typical page of a response), but it's highly dependent on your data set.
json_agg
explain analyze select json_agg(agg) from (select * from table_a join table_b on table_a.id = table_b.id limit 100) as agg;
...
Execution Time: 2.375 ms
jsonb_agg
> explain analyze select jsonb_agg(agg) from (select * from table_a join table_b on table_a.id = table_b.id limit 100) as agg;
...
Execution Time: 5.219 ms
Note: select *
is to keep the example simpleβI'm not suggesting anyone do this in their queries.
All else being equal, if performance is relevant, you'd rather use json
than jsonb
. Especially if you are implementing the pattern discussed in the relations recipe (awesome pattern), where you might have multiple relations all combined into a single result.
My point was more this: With 10k rows (or 100 rows with a huge amount of columns) other parts of the request will also be slow. The total time of a request consists of (at least) these parts:
JSON.parse
+ rest of the stuff)JSON.stringify
)I'd guess JSON.parse
and JSON.stringify
are at least as slow as the query. So by saving 2ms on the query, you're actually only making a small part of the request a tiny bit faster. Overall, the users probably won't notice a difference.
Having said all that, if there is absolutely zero reasons to keep jsonb
, we can switch to json
. But even the tiniest reason to keep jsonb
will be more important than this performance difference.
Switched to using json
instead of jsonb
in the helpers.
This may or may not be worth your attention (it's probably a niche case ?), but I've discovered that .distinct()
will fail here because you can't use DISTINCT
with json_build_object
, but you can with jsonb_build_object
.
-- β
works
SELECT
json_agg(DISTINCT jsonb_build_object('id', "players"."id")) AS "whatever"
FROM
"players"
-- β error: could not identify an equality operator for type json
SELECT
json_agg(DISTINCT json_build_object('id', "players"."id")) AS "whatever"
FROM
"players"
A super easy workaround here is to just grab the helper from the source code and modify it to use jsonb_build_object
(just have to add 1 extra character).
Very excited for JSON-related helpers added in 7215cc3c4df2a96eedd68d409f63b3350d159c58 to now come built-in to Kysely!
One potential enhancement would be to include
json
variants alongside thejsonb
forms of the helpers.Depending on what you need the result for, it can be significantly more efficient to use
json
(especially if you are returning a larger result set) since Postgres doesn't need to perform extra work to construct the tree.Once you start JSONifying over 100 rows (or wide result sets with many columns) the performance difference becomes noticeable.
Happy to add this if you are supportive. The one point worth clarifying is if we are willing to change the names of the existing helpers (e.g.
jsonObjectFrom
) to havejsonb
in the name (e.g.jsonbObjectFrom
) so that thejson
versions can be easily differentiated.