Make compatible with psycopg 3.

hwalinga commented 1 year ago

I made the code compatible with psycopg 3. The code now works on psycopg2 and psycopg 3.

I made a compat_copy_from as connection.copy_from is not available in psycopg 3 anymore and used that in the copy_from function.

I made a function to get the connection's encoding on both platforms.

The Range types are now all just Range, so made a compatibility block that changes all different types of Range to just Range on psycopg 3.

In the tests

The binary fields just return bytes now, so there is a compatible function tobytes that just returns bytes on both platforms.

The empty nested array is not anymore reduced on psycopg 3, so those tests are now skipped for psycopg 3.

Performance

I am unsure if this is as performant as possible on psycopg 3, but at least it is now compatible with both platforms.

jerch commented 1 year ago

Oh wow, thx for looking into this :smile_cat: . Will check it out later.

jerch commented 1 year ago

@hwalinga I looked through your changes - looks really good to me.

Still I am hesitating to add this for the following reason - in my early tests about different insert/update tricks I found the new copy.write_row() to outperform my manual value formatting done in python by 2-5 times*. And my value formatting only comes close to that speed by a really bad trick - it omits explicit formatting for types, whose string repr I found to be understood by postgres' TEXT format, so it is def. less reliable than the default formatters provided by psycopg3. Thus I think the whole value formatting as currently done is on stake for psycopg3.

For reference also see https://gist.github.com/jerch/fd0fae0107ce7b153b7540111b2e89ab?permalink_comment_id=4097804#gistcomment-4097804.

TL;DR: With copy.write_row() we get saver value formatting and a speedup of 2-5 times*, depending on data types in the records.

[*] insert time on the interim temp table

jerch commented 1 year ago

Did some preliminary tests with an early psycopg3 rewrite with write_row() (values are the time in seconds to update 10k instances):

10000 instances
bulk_update: 69.83318078517914
fast_update: 2.205981731414795
copy_update: 0.37508416175842285

Thats ~186x faster than bulk_update and ~6x faster than fast_update, while the psycopg2 variant is only ~2.5x faster than fast_update. So by using write_row we gain another perf bonus of ~3x, plus its implicit types checks. Def. the way to go.

Edit: The binary format gives another 5-10% perf bonus in my first tests. So we should get BINARY working, maybe with an option to resort to TEXT later on (in case someone has created a complex field w'o a suitable adapter).

jerch commented 12 months ago

More numbers on the psycopg3 rewrite testing different settings like BINARY/TEXT and using django methods to prepare values.

Trying to utilize django's get_db_prep_save method on fields for value adaption is quite expensive:

with get_db_prep_save:

# format: TEXT
10000 instances
fast_update: 2.045956015586853
copy_update: 0.48638129234313965
copy3_update: 0.7911456823348999

# format:BINARY
10000 instances
fast_update: 2.0687363147735596
copy_update: 0.5054255723953247
copy3_update: 0.6903678178787231

w'o get_db_prep_save:
```
# format: TEXT
10000 instances
fast_update: 2.0186638832092285
copy_update: 0.5016790628433228
copy3_update: 0.3683037757873535

# format: BINARY
10000 instances
fast_update: 2.128391742706299
copy_update: 0.5029968023300171
copy3_update: 0.2735722064971924  <-- almost 8x faster than fast_update!
```
(Note that I had to disable some fields on the model for the update for runtime comparison, as json and inet values must be decorated by their psycopg3 types prehand, which get_db_prep_save does. With these values decorated manually on src data, the difference between copy_update and copy3_update gets even bigger)

Perfwise the best combination is relying on psycopg3's auto adaption (not using get_db_prep_save) with the BINARY protocol. Thats a bit unfortunate, as get_db_prep_save is django's official way to mess with field values on db level, which we also should use here (Note that copy_update doesnt use it either, as the encoders there are meant as a faster replacement for value translation into TEXT format).

In general it is a good thing to be more in line with django default features, so we prolly should use get_db_prep_save by default. Since this lib is about "as fast as possible" we could introduce two more arguments to deal with that:

_skip_valuepreparation whether to use django's value preparation mechanism, if enabled the value must be in the correct format already (e.g. Jsonb(value) for a jsonb field), should default to False
_usebinary whether to use BINARY transport format. Still unclear, if we get this working for field types, if so it prolly should be enabled by default, as it further gives quite a perf bonus

Edit: More perf squeezing - with also skipping the sanity checks BINARY+_skip_valuepreparation ends up being really fast:

copy3_update: 0.22316670417785645  <-- now almost 10x faster than fast_update

Maybe _skip_valuepreparation should also skip the sanity checks, as the values must be in line anyway...

netzkolchose / django-fast-update

Make compatible with psycopg 3. #18