Open sean- opened 3 years ago
I'm assuming this PR is correct (this PR was dry-coded). Registering the following data type worked around the performance problem I was experiencing (reading data from one database and writing it to another).
conn.ConnInfo().RegisterDataType(pgtype.DataType{Value: &pgtype.Bytea{}, Name: "jsonb", OID: pgtype.JSONBOID})
Before:
0.04s 0.035% 5.17% 36.30s 31.61% github.com/jackc/pgx/v4.(*Conn).execParamsAndPreparedPrefix
0.02s 0.017% 5.19% 36.21s 31.53% github.com/jackc/pgx/v4.(*extendedQueryBuilder).AppendParam
36.01s 31.36% 36.55% 36.13s 31.46% runtime.heapBitsForAddr (inline)
0.07s 0.061% 36.61% 36.01s 31.36% github.com/jackc/pgx/v4.(*extendedQueryBuilder).encodeExtendedParamValue
0.02s 0.017% 36.62% 35.22s 30.67% github.com/jackc/pgtype.(*JSONB).Set
0.01s 0.0087% 36.63% 35.20s 30.65% github.com/jackc/pgtype.(*JSON).Set
0.25s 0.22% 36.85% 30.09s 26.20% reflect.copyVal
0.11s 0.096% 36.95% 27.85s 24.25% reflect.unsafe_New
0.29s 0.25% 37.20% 25.13s 21.88% reflect.Value.MapKeys
0 0% 37.21% 22.33s 19.44% runtime.systemstack
0.06s 0.052% 37.26% 22.05s 19.20% github.com/jackc/pgx/v4.(*connRows).Values
0 0% 37.26% 21.16s 18.43% encoding/json.Unmarshal
0 0% 37.26% 21.16s 18.43% github.com/jackc/pgtype.JSON.Get
0 0% 37.26% 21.16s 18.43% github.com/jackc/pgtype.JSONB.Get
0.68s 0.59% 37.85% 19.55s 17.02% reflect.Value.MapIndex
0.05s 0.044% 38.01% 15.76s 13.72% runtime.(*mcache).nextFree
0.44s 0.38% 38.39% 15.62s 13.60% runtime.(*mcache).refill
0.21s 0.18% 38.58% 14.31s 12.46% runtime.(*mcentral).cacheSpan
0.12s 0.1% 38.68% 13.88s 12.09% encoding/json.(*decodeState).valueInterface
I don't think this will work in the general case as is. The binary format of jsonb
includes a 1 byte prefix that must be stripped.
Beyond that, some care needs to be taken when choosing whether binary or text format is preferred. I've found that the text format is significantly faster in some cases. Not sure whether that is the case with jsonb
though. See https://www.postgresql.org/message-id/CAMovtNoHFod2jMAKQjjxv209PCTJx5Kc66anwWvX0mEiaXwgmA%40mail.gmail.com for discussion on the PG hackers list.
Oh, one other thing. If your underlying use case is copying data from one database to another at high speed then you should check out the copy protocol support in pgconn. You can use CopyFrom
and CopyTo
to copy data with much less overhead. Could be far faster than anything that uses a query.
I don't think this will work in the general case as is. The binary format of
jsonb
includes a 1 byte prefix that must be stripped.
Oh! Interesting, I didn't notice that... but that's probably because I was splatting things into a different jsonb
column, so everything checksumed correctly on the source and destination.
Beyond that, some care needs to be taken when choosing whether binary or text format is preferred. I've found that the text format is significantly faster in some cases. Not sure whether that is the case with
jsonb
though. See https://www.postgresql.org/message-id/CAMovtNoHFod2jMAKQjjxv209PCTJx5Kc66anwWvX0mEiaXwgmA%40mail.gmail.com for discussion on the PG hackers list.
This is actually querying data from CockroachDB, not PostgreSQL. In this case, the team will be querying the data as JSON. I suppose it's possible to cast TEXT
to JSONB
to extract meaningful data, but that seems pretty expensive, too. I thought, maybe incorrectly, that JSONB
would still be sent over the wire as TEXT
, and that the only difference was on-disk and the lack of ordering guarantees of the record after it was written, etc. But, if you queried a JSONB
column, you'd get a text value back that was the TEXT
encoding of the JSONB
. ?
https://www.postgresql.org/docs/current/datatype-json.html
Oh, one other thing. If your underlying use case is copying data from one database to another at high speed then you should check out the copy protocol support in pgconn. You can use
CopyFrom
andCopyTo
to copy data with much less overhead. Could be far faster than anything that uses a query.
With the Bytea{}
data type, in our workload we were network-bound, not CPU bound. For the size of data and level of parallelism, CopyFrom()
could work just as easily as the query. In nearly all cases, the writes were the limiting factor as long as JSONB
was not decoded as JSON.
But, if you queried a JSONB column, you'd get a text value back that was the TEXT encoding of the JSONB. ?
It does seem that way. This is a Postgres SQL shell:
rafiss@127:postgres> select '"\uD83D\uDE80"'::jsonb
+---------+
| jsonb |
|---------|
| "🚀" |
+---------+
SELECT 1
rafiss@127:postgres> select '"\uD83D\uDE80"'::json
+----------------+
| json |
|----------------|
| "\uD83D\uDE80" |
+----------------+
SELECT 1
This results in a >10x improvement in throughput when working with the JSONB data type.