netzkolchose / django-computedfields

Provides autogenerated autoupdated database fields for model methods.
MIT License
94 stars 14 forks source link

faster cf updates #97

Closed jerch closed 2 years ago

jerch commented 2 years ago

Playground to eval speed differences between certain update tricks.

UPDATE FROM gives a rather big speedup in postgres and sqlite >= 3.33 (testcase is to update 10000 records of SelfRef):

# sqlite with bulk_update
>>> timer(lambda : update_dependent(SelfRef.objects.all()))
18.874271631240845
# sqlite with UPDATE FROM
>>> timer(lambda : update_dependent(SelfRef.objects.all()))
3.453566312789917

# postgres with bulk_update
>>> timer(lambda : update_dependent(SelfRef.objects.all()))
17.17230486869812
# postgres with UPDATE FROM
>>> timer(lambda : update_dependent(SelfRef.objects.all()))
1.7931149005889893

Edit: With mariadb >= 10.3 UPDATE FROM can be simulated without creating a temp table like this:

UPDATE exampleapp_selfref
INNER JOIN (SELECT * FROM (VALUES (0, 1), (1,'yay!')) AS foo) AS data
ON exampleapp_selfref.id = data.0
SET exampleapp_selfref.name = data.1;

Update - runtime for mysql:

# with bulk_update
>>> timer(lambda : update_dependent(SelfRef.objects.all()))
15.962127923965454
# with UPDATE FROM workaround
>>> timer(lambda : update_dependent(SelfRef.objects.all()))
2.10331130027771
jerch commented 2 years ago

Wow, this optimized update query scales much better with a bigger batchsize than bulk_update. Numbers for COMPUTEDFIELDS_BATCHSIZE = 1000 on 10000 records:

# sqlite - big benefit
>>> timer(lambda : update_dependent(SelfRef.objects.all()))
0.6897451877593994
# postgres - small benefit
>>> timer(lambda : update_dependent(SelfRef.objects.all()))
1.6358757019042969
# mariadb - big benefit
>>> timer(lambda : update_dependent(SelfRef.objects.all()))
0.7499890327453613

Edit: postgres actually performs better with a single big statement than with using execute_values. Here the corrected numbers with the same query construction logic as done for the others for batchsize 1000:

# postgres - now a big benefit too
>>> timer(lambda : update_dependent(SelfRef.objects.all()))
0.6264104843139648

(Phew - postgres is ahead again :smile_cat:)

jerch commented 2 years ago

To summarize things for our particular perf test case:

UPDATE FROM VALUES seems to be a neat trick to get multiple columns and rows updated very fast in one go without the need to resort to a temp table in between. There is still a chance, that for very big updates (>100k records at once) a temp table might perform better, but the complexity of this seem not worth it.

Currently the main issue with this optimization is the support in db backends:

coveralls commented 2 years ago

Pull Request Test Coverage Report for Build 1751598454


Changes Missing Coverage Covered Lines Changed/Added Lines %
computedfields/fast_update.py 118 136 86.76%
<!-- Total: 132 150 88.0% -->
Totals Coverage Status
Change from base Build 1715609233: -1.2%
Covered Lines: 1338
Relevant Lines: 1389

💛 - Coveralls
jerch commented 2 years ago

Left to do: