netzkolchose / django-fast-update

Faster db updates using UPDATE FROM VALUES sql variants.
MIT License
20 stars 2 forks source link

inconsistency in duplicate skipping #11

Closed jerch closed 2 years ago

jerch commented 2 years ago

The current duplicate skipping is not stable at the batch border creating batch offsets in following data with possible follow-up skip inconsistencies:

pks: [1,2,3,3,2,2,1,1,4], batch_size = 4
fast_update creates: [1,2,3,x,x,x,x,x,4] --> [[1,2,3,4]]
bulk_update creates: [[1,2,3,x],[2,x,1,x],[4]] --> [[1,2,3], [2,1], [4]]

This is caused by prebatching as done by bulk_update vs. aggregated batching in fast_update, where the overall updates differ in the end.

NB: This makes me wonder, if the original behavior is wanted at all - the fact that a second update gets through just because it ended up in a different batch, looks like a surprising side effect, esp. as batch_size is just meant to have some control over the query load. Wouldn't it be better to treat a single bulk_update call as atomic from user perspective, thus either filter all duplicates from the whole changeset, or just disallow duplicates at all? --> https://code.djangoproject.com/ticket/33672

jerch commented 2 years ago

Oh well, the django ticket got closed as duplicate pointing to one, that got resolved as "yes, it is a mistake, but not worth to be fixed, lets just document it". Wth? Idk whats going on there, such a handwaving style is normally not for a better software outcome. Thus I asked for reconsideration, which is very unlikely to happen (the communication is weirdly one-sided anyway).

So this leads to a change of plans, how to deal with duplicates:

which gives consistent behavior regarding duplicates users can rely on, with no side effect from batch_size anymore.