netzkolchose / django-computedfields

Provides autogenerated autoupdated database fields for model methods.
MIT License
93 stars 14 forks source link

Performance improvement: _querysets_for_update - generates queries that make database unable to use Indexes in multiconditional where statements. #129

Closed BezBartek closed 1 year ago

BezBartek commented 1 year ago

ISSUE IN GENERAL: method _querysets_for_update uses OR to combine multiple statements to fetch records to update by dependencies. This way of selecting prevents the database from index usage for multiconditional where statements. That can dramatically reduce performance on not small tables. We should avoid ORs.

MY CASE/CASE STUDY [Django 4.2, Postgres 13+]: For each A creation, the library generated a query like that:

SELECT DISTINCT
   ...
FROM "A"
    LEFT OUTER JOIN "B"
        ON ("A"."id" = "B"."a_id")
    LEFT OUTER JOIN "C"
        ON ("A"."id" = C."a_id")
WHERE
     "B"."some_indexed_field_id" = XYZ OR C."a_id" = XYZ

All used fields here have btree indexes, but of course, Postgres can't use them in this scenario, because of the OR statement. So he did a full table scan for each query, which in my case, was a seq scan of around 900k+ records.

The endpoint call took around 20+ seconds on average for 10 A record creation. Most time was spent on 30+queries generated by computed fields, which were fetching a single record each time.

A simple replacement by the union, (that I added in this pull request), reduced that time to ~1/2s. The same benefits would be with subqueries and the in statement scenario (second proposal).

SIMPLE SOLUTION (what I did): Replacing ORs with Unions, allow the database to use indexes

disadvantages: Changes interface of queryset, we can't use multiple methods on that queryset later. for example - you can't use a filter (which fast_update does)

BETTER SOLUTION (require more changes): Replace ORs with Subqueries, like:

A.id in (Select A.id From model B, Select A.id From model C, ....)

It does not have the Union solution disadvantages and has the same performance benefits.

About PR: This PR is a proposal, I can help to implement the subquery scenario. Union also feels fine, but for sure it requires more investigation. The Union proposal would require adding parametrization to tests, to run them with and without fast updates, which I didn't add (I changed that flag for tests only locally), cus I do not know which solution will be chosen for this issue.
If the Union proposal would be accepted, please let me know, and I can add that parametrization.

Thank for the library and response :)

jerch commented 1 year ago

Thx for finding this. Tbh I am not very fond of the current query construction, esp. around the DINSTINCT clause (see #101, where I tried to tackle the issue, but did not get far due to ORM limitations).

Whether the UNION construction or IN filtering is the better approach I am not quite sure yet, on a first glance I'd prefer the IN/subquery approach, as UNION is a quite different operation. Does UNION pass all test cases locally? (The test cases should be pretty complete in terms of what is expressible with the ORM).

If both ways turn out to be equal (not relational algebra sense, but in the updated results), then I'd opt for the faster one. But this needs several more tests with different model setups. Can you describe your concrete model schematics, and maybe the cardinality/selectivity of the filtered fields during the inserts? Were that single inserts per transaction or bulked multiples under one transaction?

BezBartek commented 1 year ago

@jerch Thank you for your response.

"Does UNION pass all test cases locally? (The test cases should be pretty complete in terms of what is expressible with the ORM)." - Yes it does, all tests pass locally (for both scenarios: fast update flag - true/false), also in my project, we switched to fork with a union, cus the issue that we faced was critical. We didn't face any problems so far (from yesterday :D), also we do not use the fast updates flag (it is set to false).

In our case model structure is like this:

class LedgerEntry(ComputedFieldsModel):
  .....

  def some_computed_field_depends_on_childs():
      pass

class Transaction(LedgerEntry):
   .....

class ReturnTransaction(LedgerEntry):
   transaction = models.ForeignKey(B)

We were creating multiple database transactions, and in each transaction, we were creating a Transaction record. That record creation forced the computed fields library, to do calculations, and generate a query that I described in the first message, with OR statements.

jerch commented 1 year ago

Just checked the test cases - multiple deps to a single other model is not really tested beside once for table inheritance :blush:

I did a first test with the setup below (see at the end of this comment). Note that I have not used another model indirection, thus have no joins in the query. Most likely postgres has dropped to seq scan on your joined table, as both filtered fields are indirected from joins.

Could you provide a more complete stub of your model setup? It does not need to contain any revealing names and such, just the cf declarations with model stubs. Also postgres' planner output would be helpful. Do you use table inheritance by any chance? Because w'o table inheritance it is quite tricky to even construct your case (thus it is somewhat an edge case, and sadly table inheritance is known to be very costly with computed fields).

The problem with the UNION approach is again the DISTINCT clause, while postgres supports combining those, the ORM does not. So while you might see a big perf improvement, it might be much worse for a different model setup.

Edit: What issues do you see for fast_update here, beside possible duplications (from missing DISTINCT)?


Simple test w'o joins:

class MulField(models.Model):
    f1 = models.IntegerField()
    f2 = models.IntegerField()

class MulDerived(ComputedFieldsModel):
    f = models.ForeignKey(MulField, on_delete=models.CASCADE)
    ff = models.ForeignKey(MulField, on_delete=models.CASCADE, related_name='f_second')

    @computed(
        models.IntegerField(),
        depends=[('f', ['f1']), ('ff', ['f2'])],
        prefetch_related=('f', 'ff')
    )
    def add(self):
        return self.f.f1 + self.ff.f2

    @computed(
        models.IntegerField(),
        depends=[('f', ['f1']), ('ff', ['f2'])],
        prefetch_related=('f', 'ff')
    )
    def prod(self):
        return self.f.f1 * self.ff.f2

I created 1000 MulField instances linked randomly to f and ff on 100k MulDerived instances (thus selectivity is ~0.01). The test created 20 new MulField instances:

>>> t(lambda: [MulField.objects.create(f1=1,f2=2) for i in range(20)])

with these results:

In postgres I get the following select queries after a single insert:

BezBartek commented 1 year ago

@jerch Yes, we do table inherences, like in the example.

Yes, I agree, unions have performance boosts, only in cases, where there are indexes, and the current approach would generate joints that will prevent Postgres from index usage. But these cases (joins/Or across multiple tables) might be even predicted on path iteration, during query construction. By comparing paths. So we would be able to pick a sufficient solution.

Subqueries might have better performance than unions in regular cases.

In the case of Union, you do not need to explicitly use DISTINCT, union does it out of the box, cus it returns unique records. That's why I added a condition to skip it in case of the union. Are there cases where the library does not add a distinct statement to the query?

The query plan for my case:

Unique  (cost=17981.40..18001.13 rows=142 width=77) (actual time=1163.356..1166.962 rows=1 loops=1)
  ->  Gather Merge  (cost=17981.40..17997.93 rows=142 width=77) (actual time=1163.353..1166.930 rows=1 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        ->  Sort  (cost=16981.37..16981.52 rows=59 width=77) (actual time=1160.720..1160.741 rows=0 loops=3)
"              Sort Key: sv_cards_ledgerentry.id"
              Sort Method: quicksort  Memory: 25kB
              Worker 0:  Sort Method: quicksort  Memory: 25kB
              Worker 1:  Sort Method: quicksort  Memory: 25kB
              ->  Parallel Hash Left Join  (cost=4413.85..16979.64 rows=59 width=77) (actual time=919.387..1160.595 rows=0 loops=3)
                    Hash Cond: (sv_cards_ledgerentry.id = t4.ledgerentry_ptr_id)
                    Filter: ((sv_cards_svreturntransaction.transaction_id = 6386) OR (t4.ledgerentry_ptr_id = 6386))
                    Rows Removed by Filter: 204526
                    ->  Hash Left Join  (cost=141.25..12035.93 rows=255658 width=81) (actual time=10.902..760.889 rows=204527 loops=3)
                          Hash Cond: (sv_cards_ledgerentry.id = sv_cards_svreturntransaction.ledgerentry_ptr_id)
                          ->  Parallel Seq Scan on sv_cards_ledgerentry  (cost=0.00..11223.58 rows=255658 width=77) (actual time=0.008..249.329 rows=204527 loops=3)
                          ->  Hash  (cost=86.11..86.11 rows=4411 width=8) (actual time=10.855..10.859 rows=4411 loops=3)
                                Buckets: 8192  Batches: 1  Memory Usage: 237kB
                                ->  Seq Scan on sv_cards_svreturntransaction  (cost=0.00..86.11 rows=4411 width=8) (actual time=0.016..5.424 rows=4411 loops=3)
                    ->  Parallel Hash  (cost=2938.38..2938.38 rows=106738 width=4) (actual time=150.710..150.714 rows=60452 loops=3)
                          Buckets: 262144  Batches: 1  Memory Usage: 9216kB
                          ->  Parallel Seq Scan on sv_cards_svtransaction t4  (cost=0.00..2938.38 rows=106738 width=4) (actual time=0.008..74.534 rows=60452 loops=3)
Planning Time: 0.453 ms
Execution Time: 1167.070 ms

Switch to a union, allow the database to use indexes, and dramatically reduced execution time from this plan

jerch commented 1 year ago

In the case of Union, you do not need to explicitly use DISTINCT, union does it out of the box, cus it returns unique records.

Yeah thats clear, I wasnt sure if the ORM treats this case properly:

>>> print(MA.objects.none().union(MA.objects.all()).query)
SELECT "exampleapp_ma"."id", "exampleapp_ma"."f1", "exampleapp_ma"."f2" FROM "exampleapp_ma"

(because thats how the field deps get added). Well, it simply reverts to a normal select, thus may still contain duplicates - but that would get "patched" afterwards in the code by your queryset.query.combinator != "union" condition.

Since the change with UNION is much less invasive, I'd lean towards that atm.

Remains the question - which issues do you see for fast_update here? Because that operates only on changed rows collected on python side, I dont see there any issues actually?

(On a sidenote: You'll get a really big perf boost using fast_update for tables >100k, esp. for bigger bulk changes and calling update_dependent yourself.)

BezBartek commented 1 year ago

@jerch If you remove my condition if self.use_fastupdate: that does not apply unions in the case of fast update, two tests would fail. It will happen cus a fast update does bulk_update and in the bulk update, django performs filtering on a queryset. And in Django ORM you can't do filtering on unions. Mentioned tests are:

in _test_full.tests.testmultitable.TestMultiTable With fast_update disabled, it seems that there are no bulk updates, so there is no filtering over union queryset, so everything passes

jerch commented 1 year ago

Well thats easy to fix (although slightly more expensive):

@@ -553,7 +555,7 @@ class Resolver:
                     self._update(queryset, change, fields)
                     change = []
             if change:
-                self._update(queryset, change, fields)
+                self._update(model._base_manager.filter(pk__in=pks), change, fields)

since bulk_updater has to collect the pks anyway, we can escape that issue by a new query.

With fast_update disabled, it seems that there are no bulk updates, so there is no filtering over union queryset, so everything passes

Thats actually pretty weird, all that fast_update does, is delegating nonlocal concrete fields to bulk_update (happens with inherited tables), while local concrete fields are updated without any queryset trickery by an UPDATE ... FROM VALUES .... With fast_update being disabled, it would just use bulk_update for all fields, thus should run into the same issue. Hmm there is something off here, it seems. (Edit: Nope, its not weird. This is because bulk_update is called on the manager, which implicitly refers to .all() as queryset.)

jerch commented 1 year ago

Btw it should be safe to just use self._update(model._base_manager.all(), change, fields) for all self._update invocations. Wanna do that change? Then I think the PR can be added.

coveralls commented 1 year ago

Pull Request Test Coverage Report for Build 5421498965


Changes Missing Coverage Covered Lines Changed/Added Lines %
computedfields/resolver.py 4 5 80.0%
<!-- Total: 4 5 80.0% -->
Totals Coverage Status
Change from base Build 4690356054: 0.01%
Covered Lines: 1108
Relevant Lines: 1154

💛 - Coveralls
BezBartek commented 1 year ago

Yea, sure, thank you, but what with detection? I think that not so complicated code might be able to use current OR's in particular multiconditional statements. Like, when only difference is in paths ends, it would mean that they perform same joins, just uses different fields from same table, like in your example. Otherwise: union By that, we would reduce union impact in simple cases. What do you think? For sure, it would be additional complication in code, but maybe worth it.

jerch commented 1 year ago

Hmm, yeah, but I am not sure, if the additional code complexity is justified by the small difference for the less heavy model setups. In the end the difference above was only 7.5s vs. 8.8s for an update_dependent on full desync state on 100k rows (thus <20% difference).

But sure feel free to try an ORs vs. UNION case separation, and benchmark it.

(Sidenote - again nagging about the DISTINCT issue - I think there is much more to win fixing that at some point, than bringing back the ORs here.)

BezBartek commented 1 year ago

@jerch Ok, tomorrow I will add that _self._update(model._base_manager.all(), change, fields) for all self._update_ and I will notify you that it is ready to merge. I will check how good that optimization (cases separation) might be, and if it would look fine, then I will create a separate PR for your consideration.

jerch commented 1 year ago

Yupp, sounds good. I also gonna check, if that weird queryset behavior can be fixed on fast_update (https://github.com/netzkolchose/django-fast-update/issues/20).

BezBartek commented 1 year ago

@jerch Rdy to merge