netzkolchose / django-fast-update

Faster db updates using UPDATE FROM VALUES sql variants.
MIT License
20 stars 2 forks source link

fast_update dies when a field is provided multiple times in the fields parameter #1

Closed mikicz closed 2 years ago

mikicz commented 2 years ago

Hey, was testing the package (coming from the django ticket, it was reported based on the data I gathered) and this was a thing that took me a while to figure out. In one of my company's projects I replaced all of the usages of django-bulk-update with fast_update and ran tests and some began failing and I figured out that in some cases we sent fields duplicated in the fields parameter and it failed with an ugly error in the SQL, after fixing that on our part all tests passed.

The error looked something like

E               django.db.utils.ProgrammingError: column reference "b" is ambiguous
E               LINE 1: UPDATE "A" SET "b"=CAST("d"."b"...
E                   

Because the query looked something

UPDATE "A"
SET "b"=CAST("d"."b" AS numeric(20, 2)),
    "c"=CAST("d"."c" AS varchar(255)),
    "b"=CAST("d"."b" AS numeric(20, 2))
FROM (VALUES (%s, %s, %s, %s)) AS "d" ("id", "b", "c", "b")
WHERE "A"."id" = "d"."id"

It would be nice to get a more explicit message for this, for example from the sanity_check function? Should be an easy check on whether the length of fields is the same as length of set of fields?

jerch commented 2 years ago

@mikicz Thx for testing it out :+1:

Oh well I did not even think anyone would try to send duplicated fields, so this is not tested for/dealt with. Will have to check, how bulk_update deals with that, easiest is prolly to do a set reduction in sanity_check (if its compatible to bulk_update).

mikicz commented 2 years ago

I'm not sure about the efficiency of how it works in bulk update, but it definitely doesn't throw an error. I am definitely going to deduplicate the fields even for bulk update though, just because it's nicer

jerch commented 2 years ago

bulk_update makes an implicit field reduction in https://github.com/django/django/blob/0b31e024873681e187b574fe1c4afe5e48aeeecf/django/db/models/query.py#L766 by possibly overwriting previously set key/value. In terms of efficiency thats somewhat wasteful, as the constructed CASE value will be the same as before (since field values cannot change in between). Imho that should be reduced prehand there as well.

Imho a simple set reduction as fields = list(set(fields)) will do the trick here, we dont even have to track initial field ordering (for the created SQL it does not matter where the non pk columns occur). We still need a stable ordering down the road (thus the list conversion), so values dont slip into the wrong column.