Closed BezBartek closed 1 year ago
Thx for finding this. Tbh I am not very fond of the current query construction, esp. around the DINSTINCT clause (see #101, where I tried to tackle the issue, but did not get far due to ORM limitations).
Whether the UNION construction or IN filtering is the better approach I am not quite sure yet, on a first glance I'd prefer the IN/subquery approach, as UNION is a quite different operation. Does UNION pass all test cases locally? (The test cases should be pretty complete in terms of what is expressible with the ORM).
If both ways turn out to be equal (not relational algebra sense, but in the updated results), then I'd opt for the faster one. But this needs several more tests with different model setups. Can you describe your concrete model schematics, and maybe the cardinality/selectivity of the filtered fields during the inserts? Were that single inserts per transaction or bulked multiples under one transaction?
@jerch Thank you for your response.
"Does UNION pass all test cases locally? (The test cases should be pretty complete in terms of what is expressible with the ORM)." - Yes it does, all tests pass locally (for both scenarios: fast update flag - true/false), also in my project, we switched to fork with a union, cus the issue that we faced was critical. We didn't face any problems so far (from yesterday :D), also we do not use the fast updates flag (it is set to false).
In our case model structure is like this:
class LedgerEntry(ComputedFieldsModel):
.....
def some_computed_field_depends_on_childs():
pass
class Transaction(LedgerEntry):
.....
class ReturnTransaction(LedgerEntry):
transaction = models.ForeignKey(B)
We were creating multiple database transactions, and in each transaction, we were creating a Transaction record. That record creation forced the computed fields library, to do calculations, and generate a query that I described in the first message, with OR statements.
Just checked the test cases - multiple deps to a single other model is not really tested beside once for table inheritance :blush:
I did a first test with the setup below (see at the end of this comment). Note that I have not used another model indirection, thus have no joins in the query. Most likely postgres has dropped to seq scan on your joined table, as both filtered fields are indirected from joins.
Could you provide a more complete stub of your model setup? It does not need to contain any revealing names and such, just the cf declarations with model stubs. Also postgres' planner output would be helpful. Do you use table inheritance by any chance? Because w'o table inheritance it is quite tricky to even construct your case (thus it is somewhat an edge case, and sadly table inheritance is known to be very costly with computed fields).
The problem with the UNION approach is again the DISTINCT clause, while postgres supports combining those, the ORM does not. So while you might see a big perf improvement, it might be much worse for a different model setup.
Edit:
What issues do you see for fast_update
here, beside possible duplications (from missing DISTINCT)?
Simple test w'o joins:
class MulField(models.Model):
f1 = models.IntegerField()
f2 = models.IntegerField()
class MulDerived(ComputedFieldsModel):
f = models.ForeignKey(MulField, on_delete=models.CASCADE)
ff = models.ForeignKey(MulField, on_delete=models.CASCADE, related_name='f_second')
@computed(
models.IntegerField(),
depends=[('f', ['f1']), ('ff', ['f2'])],
prefetch_related=('f', 'ff')
)
def add(self):
return self.f.f1 + self.ff.f2
@computed(
models.IntegerField(),
depends=[('f', ['f1']), ('ff', ['f2'])],
prefetch_related=('f', 'ff')
)
def prod(self):
return self.f.f1 * self.ff.f2
I created 1000 MulField instances linked randomly to f
and ff
on 100k MulDerived instances (thus selectivity is ~0.01). The test created 20 new MulField instances:
>>> t(lambda: [MulField.objects.create(f1=1,f2=2) for i in range(20)])
with these results:
In postgres I get the following select queries after a single insert:
postgres=# EXPLAIN ANALYZE SELECT DISTINCT "exampleapp_mulderived"."id", "exampleapp_mulderived"."f_id", "exampleapp_mulderived"."ff_id", "exampleapp_mulderived"."add", "exampleapp_mulderived"."prod" FROM "exampleapp_mulderived" WHERE ("exampleapp_mulderived"."ff_id" = 1020 OR "exampleapp_mulderived"."f_id" = 1020) ORDER BY "exampleapp_mulderived"."id" ASC LIMIT 10000;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=544.56..547.51 rows=197 width=20) (actual time=0.059..0.063 rows=0 loops=1)
-> Unique (cost=544.56..547.51 rows=197 width=20) (actual time=0.057..0.060 rows=0 loops=1)
-> Sort (cost=544.56..545.05 rows=197 width=20) (actual time=0.056..0.058 rows=0 loops=1)
Sort Key: id, f_id, ff_id, add, prod
Sort Method: quicksort Memory: 25kB
-> Bitmap Heap Scan on exampleapp_mulderived (cost=10.17..537.05 rows=197 width=20) (actual time=0.043..0.045 rows=0 loops=1)
Recheck Cond: ((ff_id = 1020) OR (f_id = 1020))
-> BitmapOr (cost=10.17..10.17 rows=197 width=0) (actual time=0.038..0.040 rows=0 loops=1)
-> Bitmap Index Scan on exampleapp_mulderived_ff_id_3536b932 (cost=0.00..5.04 rows=99 width=0) (actual time=0.028..0.028 rows=0 loops=1)
Index Cond: (ff_id = 1020)
-> Bitmap Index Scan on exampleapp_mulderived_f_id_3a01ffa2 (cost=0.00..5.04 rows=99 width=0) (actual time=0.008..0.009 rows=0 loops=1)
Index Cond: (f_id = 1020)
Planning Time: 0.324 ms
Execution Time: 0.175 ms
(14 rows)
postgres=# EXPLAIN ANALYZE SELECT "exampleapp_mulderived"."id", "exampleapp_mulderived"."f_id", "exampleapp_mulderived"."ff_id", "exampleapp_mulderived"."add", "exampleapp_mulderived"."prod" FROM "exampleapp_mulderived" WHERE "exampleapp_mulderived"."id" IN ((SELECT U0."id" FROM "exampleapp_mulderived" U0 WHERE U0."ff_id" = 1040) UNION (SELECT U0."id" FROM "exampleapp_mulderived" U0 WHERE U0."f_id" = 1040)) ORDER BY "exampleapp_mulderived"."id" ASC LIMIT 10000;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=2131.59..2132.08 rows=198 width=20) (actual time=0.063..0.069 rows=0 loops=1)
-> Sort (cost=2131.59..2132.08 rows=198 width=20) (actual time=0.061..0.066 rows=0 loops=1)
Sort Key: exampleapp_mulderived.id
Sort Method: quicksort Memory: 25kB
-> Nested Loop (cost=626.36..2124.03 rows=198 width=20) (actual time=0.051..0.055 rows=0 loops=1)
-> HashAggregate (cost=625.94..627.92 rows=198 width=4) (actual time=0.050..0.053 rows=0 loops=1)
Group Key: u0.id
Batches: 1 Memory Usage: 40kB
-> Append (cost=5.06..625.45 rows=198 width=4) (actual time=0.044..0.046 rows=0 loops=1)
-> Bitmap Heap Scan on exampleapp_mulderived u0 (cost=5.06..311.24 rows=99 width=4) (actual time=0.032..0.032 rows=0 loops=1)
Recheck Cond: (ff_id = 1040)
-> Bitmap Index Scan on exampleapp_mulderived_ff_id_3536b932 (cost=0.00..5.04 rows=99 width=0) (actual time=0.026..0.027 rows=0 loops=1)
Index Cond: (ff_id = 1040)
-> Bitmap Heap Scan on exampleapp_mulderived u0_1 (cost=5.06..311.24 rows=99 width=4) (actual time=0.010..0.010 rows=0 loops=1)
Recheck Cond: (f_id = 1040)
-> Bitmap Index Scan on exampleapp_mulderived_f_id_3a01ffa2 (cost=0.00..5.04 rows=99 width=0) (actual time=0.009..0.009 rows=0 loops=1)
Index Cond: (f_id = 1040)
-> Index Scan using exampleapp_mulderived_pkey on exampleapp_mulderived (cost=0.42..7.55 rows=1 width=20) (never executed)
Index Cond: (id = u0.id)
Planning Time: 0.682 ms
Execution Time: 0.202 ms
(21 rows)
It shows that postgres is able to use the index for both fields in both cases, but has actually more work to do for the UNION approach, which is therefore slightly slower. Ofc this picture will change if joins are involved, as the DBMS cannot use table indexes anymore on joined in fields.
@jerch Yes, we do table inherences, like in the example.
Yes, I agree, unions have performance boosts, only in cases, where there are indexes, and the current approach would generate joints that will prevent Postgres from index usage. But these cases (joins/Or across multiple tables) might be even predicted on path iteration, during query construction. By comparing paths. So we would be able to pick a sufficient solution.
Subqueries might have better performance than unions in regular cases.
In the case of Union, you do not need to explicitly use DISTINCT, union does it out of the box, cus it returns unique records. That's why I added a condition to skip it in case of the union. Are there cases where the library does not add a distinct statement to the query?
The query plan for my case:
Unique (cost=17981.40..18001.13 rows=142 width=77) (actual time=1163.356..1166.962 rows=1 loops=1)
-> Gather Merge (cost=17981.40..17997.93 rows=142 width=77) (actual time=1163.353..1166.930 rows=1 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=16981.37..16981.52 rows=59 width=77) (actual time=1160.720..1160.741 rows=0 loops=3)
" Sort Key: sv_cards_ledgerentry.id"
Sort Method: quicksort Memory: 25kB
Worker 0: Sort Method: quicksort Memory: 25kB
Worker 1: Sort Method: quicksort Memory: 25kB
-> Parallel Hash Left Join (cost=4413.85..16979.64 rows=59 width=77) (actual time=919.387..1160.595 rows=0 loops=3)
Hash Cond: (sv_cards_ledgerentry.id = t4.ledgerentry_ptr_id)
Filter: ((sv_cards_svreturntransaction.transaction_id = 6386) OR (t4.ledgerentry_ptr_id = 6386))
Rows Removed by Filter: 204526
-> Hash Left Join (cost=141.25..12035.93 rows=255658 width=81) (actual time=10.902..760.889 rows=204527 loops=3)
Hash Cond: (sv_cards_ledgerentry.id = sv_cards_svreturntransaction.ledgerentry_ptr_id)
-> Parallel Seq Scan on sv_cards_ledgerentry (cost=0.00..11223.58 rows=255658 width=77) (actual time=0.008..249.329 rows=204527 loops=3)
-> Hash (cost=86.11..86.11 rows=4411 width=8) (actual time=10.855..10.859 rows=4411 loops=3)
Buckets: 8192 Batches: 1 Memory Usage: 237kB
-> Seq Scan on sv_cards_svreturntransaction (cost=0.00..86.11 rows=4411 width=8) (actual time=0.016..5.424 rows=4411 loops=3)
-> Parallel Hash (cost=2938.38..2938.38 rows=106738 width=4) (actual time=150.710..150.714 rows=60452 loops=3)
Buckets: 262144 Batches: 1 Memory Usage: 9216kB
-> Parallel Seq Scan on sv_cards_svtransaction t4 (cost=0.00..2938.38 rows=106738 width=4) (actual time=0.008..74.534 rows=60452 loops=3)
Planning Time: 0.453 ms
Execution Time: 1167.070 ms
Switch to a union, allow the database to use indexes, and dramatically reduced execution time from this plan
In the case of Union, you do not need to explicitly use DISTINCT, union does it out of the box, cus it returns unique records.
Yeah thats clear, I wasnt sure if the ORM treats this case properly:
>>> print(MA.objects.none().union(MA.objects.all()).query)
SELECT "exampleapp_ma"."id", "exampleapp_ma"."f1", "exampleapp_ma"."f2" FROM "exampleapp_ma"
(because thats how the field deps get added). Well, it simply reverts to a normal select, thus may still contain duplicates - but that would get "patched" afterwards in the code by your queryset.query.combinator != "union"
condition.
Since the change with UNION is much less invasive, I'd lean towards that atm.
Remains the question - which issues do you see for fast_update
here? Because that operates only on changed rows collected on python side, I dont see there any issues actually?
(On a sidenote: You'll get a really big perf boost using fast_update
for tables >100k, esp. for bigger bulk changes and calling update_dependent
yourself.)
@jerch
If you remove my condition if self.use_fastupdate:
that does not apply unions in the case of fast update, two tests would fail.
It will happen cus a fast update does bulk_update and in the bulk update, django performs filtering on a queryset. And in Django ORM you can't do filtering on unions.
Mentioned tests are:
in _test_full.tests.testmultitable.TestMultiTable With fast_update disabled, it seems that there are no bulk updates, so there is no filtering over union queryset, so everything passes
Well thats easy to fix (although slightly more expensive):
@@ -553,7 +555,7 @@ class Resolver:
self._update(queryset, change, fields)
change = []
if change:
- self._update(queryset, change, fields)
+ self._update(model._base_manager.filter(pk__in=pks), change, fields)
since bulk_updater
has to collect the pks anyway, we can escape that issue by a new query.
With fast_update disabled, it seems that there are no bulk updates, so there is no filtering over union queryset, so everything passes
Thats actually pretty weird, all that fast_update
does, is delegating nonlocal concrete fields to bulk_update (happens with inherited tables), while local concrete fields are updated without any queryset trickery by an UPDATE ... FROM VALUES ...
. With fast_update being disabled, it would just use bulk_update for all fields, thus should run into the same issue. Hmm there is something off here, it seems. (Edit: Nope, its not weird. This is because bulk_update
is called on the manager, which implicitly refers to .all()
as queryset.)
Btw it should be safe to just use self._update(model._base_manager.all(), change, fields)
for all self._update
invocations. Wanna do that change? Then I think the PR can be added.
Changes Missing Coverage | Covered Lines | Changed/Added Lines | % | ||
---|---|---|---|---|---|
computedfields/resolver.py | 4 | 5 | 80.0% | ||
<!-- | Total: | 4 | 5 | 80.0% | --> |
Totals | |
---|---|
Change from base Build 4690356054: | 0.01% |
Covered Lines: | 1108 |
Relevant Lines: | 1154 |
Yea, sure, thank you, but what with detection? I think that not so complicated code might be able to use current OR's in particular multiconditional statements. Like, when only difference is in paths ends, it would mean that they perform same joins, just uses different fields from same table, like in your example. Otherwise: union By that, we would reduce union impact in simple cases. What do you think? For sure, it would be additional complication in code, but maybe worth it.
Hmm, yeah, but I am not sure, if the additional code complexity is justified by the small difference for the less heavy model setups. In the end the difference above was only 7.5s vs. 8.8s for an update_dependent
on full desync state on 100k rows (thus <20% difference).
But sure feel free to try an ORs vs. UNION case separation, and benchmark it.
(Sidenote - again nagging about the DISTINCT issue - I think there is much more to win fixing that at some point, than bringing back the ORs here.)
@jerch
Ok, tomorrow I will add that _self._update(model._base_manager.all(), change, fields)
for all self._update
_ and I will notify you that it is ready to merge.
I will check how good that optimization (cases separation) might be, and if it would look fine, then I will create a separate PR for your consideration.
Yupp, sounds good. I also gonna check, if that weird queryset behavior can be fixed on fast_update
(https://github.com/netzkolchose/django-fast-update/issues/20).
@jerch Rdy to merge
ISSUE IN GENERAL: method
_querysets_for_update
uses OR to combine multiple statements to fetch records to update by dependencies. This way of selecting prevents the database from index usage for multiconditional where statements. That can dramatically reduce performance on not small tables. We should avoid ORs.MY CASE/CASE STUDY [Django 4.2, Postgres 13+]: For each A creation, the library generated a query like that:
All used fields here have btree indexes, but of course, Postgres can't use them in this scenario, because of the OR statement. So he did a full table scan for each query, which in my case, was a seq scan of around 900k+ records.
The endpoint call took around 20+ seconds on average for 10 A record creation. Most time was spent on 30+queries generated by computed fields, which were fetching a single record each time.
A simple replacement by the union, (that I added in this pull request), reduced that time to ~1/2s. The same benefits would be with subqueries and the
in
statement scenario (second proposal).SIMPLE SOLUTION (what I did): Replacing ORs with Unions, allow the database to use indexes
disadvantages: Changes interface of queryset, we can't use multiple methods on that queryset later. for example - you can't use a filter (which fast_update does)
BETTER SOLUTION (require more changes): Replace ORs with Subqueries, like:
A.id in (Select A.id From model B, Select A.id From model C, ....)
It does not have the Union solution disadvantages and has the same performance benefits.
About PR: This PR is a proposal, I can help to implement the subquery scenario. Union also feels fine, but for sure it requires more investigation. The Union proposal would require adding parametrization to tests, to run them with and without fast updates, which I didn't add (I changed that flag for tests only locally), cus I do not know which solution will be chosen for this issue.
If the Union proposal would be accepted, please let me know, and I can add that parametrization.
Thank for the library and response :)