jackc / pgtype

MIT License
308 stars 110 forks source link

Decode JSONBOID as binary bytes, like TextOID and JSONOID #115

Open sean- opened 3 years ago

sean- commented 3 years ago

This results in a >10x improvement in throughput when working with the JSONB data type.

sean- commented 3 years ago

I'm assuming this PR is correct (this PR was dry-coded). Registering the following data type worked around the performance problem I was experiencing (reading data from one database and writing it to another).

    conn.ConnInfo().RegisterDataType(pgtype.DataType{Value: &pgtype.Bytea{}, Name: "jsonb", OID: pgtype.JSONBOID})

Before:

     0.04s 0.035%  5.17%     36.30s 31.61%  github.com/jackc/pgx/v4.(*Conn).execParamsAndPreparedPrefix
     0.02s 0.017%  5.19%     36.21s 31.53%  github.com/jackc/pgx/v4.(*extendedQueryBuilder).AppendParam
    36.01s 31.36% 36.55%     36.13s 31.46%  runtime.heapBitsForAddr (inline)
     0.07s 0.061% 36.61%     36.01s 31.36%  github.com/jackc/pgx/v4.(*extendedQueryBuilder).encodeExtendedParamValue
     0.02s 0.017% 36.62%     35.22s 30.67%  github.com/jackc/pgtype.(*JSONB).Set
     0.01s 0.0087% 36.63%     35.20s 30.65%  github.com/jackc/pgtype.(*JSON).Set
     0.25s  0.22% 36.85%     30.09s 26.20%  reflect.copyVal
     0.11s 0.096% 36.95%     27.85s 24.25%  reflect.unsafe_New
     0.29s  0.25% 37.20%     25.13s 21.88%  reflect.Value.MapKeys
         0     0% 37.21%     22.33s 19.44%  runtime.systemstack
     0.06s 0.052% 37.26%     22.05s 19.20%  github.com/jackc/pgx/v4.(*connRows).Values
         0     0% 37.26%     21.16s 18.43%  encoding/json.Unmarshal
         0     0% 37.26%     21.16s 18.43%  github.com/jackc/pgtype.JSON.Get
         0     0% 37.26%     21.16s 18.43%  github.com/jackc/pgtype.JSONB.Get
     0.68s  0.59% 37.85%     19.55s 17.02%  reflect.Value.MapIndex
     0.05s 0.044% 38.01%     15.76s 13.72%  runtime.(*mcache).nextFree
     0.44s  0.38% 38.39%     15.62s 13.60%  runtime.(*mcache).refill
     0.21s  0.18% 38.58%     14.31s 12.46%  runtime.(*mcentral).cacheSpan
     0.12s   0.1% 38.68%     13.88s 12.09%  encoding/json.(*decodeState).valueInterface
jackc commented 3 years ago

I don't think this will work in the general case as is. The binary format of jsonb includes a 1 byte prefix that must be stripped.

Beyond that, some care needs to be taken when choosing whether binary or text format is preferred. I've found that the text format is significantly faster in some cases. Not sure whether that is the case with jsonb though. See https://www.postgresql.org/message-id/CAMovtNoHFod2jMAKQjjxv209PCTJx5Kc66anwWvX0mEiaXwgmA%40mail.gmail.com for discussion on the PG hackers list.

Oh, one other thing. If your underlying use case is copying data from one database to another at high speed then you should check out the copy protocol support in pgconn. You can use CopyFrom and CopyTo to copy data with much less overhead. Could be far faster than anything that uses a query.

sean- commented 3 years ago

I don't think this will work in the general case as is. The binary format of jsonb includes a 1 byte prefix that must be stripped.

Oh! Interesting, I didn't notice that... but that's probably because I was splatting things into a different jsonb column, so everything checksumed correctly on the source and destination.

Beyond that, some care needs to be taken when choosing whether binary or text format is preferred. I've found that the text format is significantly faster in some cases. Not sure whether that is the case with jsonb though. See https://www.postgresql.org/message-id/CAMovtNoHFod2jMAKQjjxv209PCTJx5Kc66anwWvX0mEiaXwgmA%40mail.gmail.com for discussion on the PG hackers list.

This is actually querying data from CockroachDB, not PostgreSQL. In this case, the team will be querying the data as JSON. I suppose it's possible to cast TEXT to JSONB to extract meaningful data, but that seems pretty expensive, too. I thought, maybe incorrectly, that JSONB would still be sent over the wire as TEXT, and that the only difference was on-disk and the lack of ordering guarantees of the record after it was written, etc. But, if you queried a JSONB column, you'd get a text value back that was the TEXT encoding of the JSONB. ?

https://www.postgresql.org/docs/current/datatype-json.html

Oh, one other thing. If your underlying use case is copying data from one database to another at high speed then you should check out the copy protocol support in pgconn. You can use CopyFrom and CopyTo to copy data with much less overhead. Could be far faster than anything that uses a query.

With the Bytea{} data type, in our workload we were network-bound, not CPU bound. For the size of data and level of parallelism, CopyFrom() could work just as easily as the query. In nearly all cases, the writes were the limiting factor as long as JSONB was not decoded as JSON.

rafiss commented 3 years ago

But, if you queried a JSONB column, you'd get a text value back that was the TEXT encoding of the JSONB. ?

It does seem that way. This is a Postgres SQL shell:

rafiss@127:postgres> select '"\uD83D\uDE80"'::jsonb
+---------+
| jsonb   |
|---------|
| "🚀"    |
+---------+
SELECT 1

rafiss@127:postgres> select '"\uD83D\uDE80"'::json
+----------------+
| json           |
|----------------|
| "\uD83D\uDE80" |
+----------------+
SELECT 1