Many recomputations when setting many to many

martinlehoux commented 9 months ago

Issue

I made a PR to show case my issue https://github.com/netzkolchose/django-computedfields/pull/144
The real life issue is that i have say
- 10 tags
- 5000 adverts
Each time I tags.set(...) (actually, it's a .set trigger by an admin form), I see my computed field recomputed for all my adverts
I must say that it only occurs when i set to a different list
By looking into a debugger, I found that the post_remove handler is the one triggering all these updates

Resolution

Is it the correct behaviour? I know there are complex issues with this kind of relationship, so I might be missing something
If not, I feel that we should be able to limit these updates, by looking the side of the relationship triggering the m2m handler
If this looks like a bug, tell me and I'll see if I can try to fix it :)

jerch commented 9 months ago

Yeah in general there is a high chance to miss updates at certain ends with M2M, kinda learned that the hard when I implemented it initially. M2Ms are esp. tricky, if cfs span across them in interdepencies or even link back like this:

A-cf1 :n --> n: B-cf1 :n --> n: A-cf2

which seems awkward at first, but can happen, if you have various cf aggregates on model A and B pulling data forth and back across the M2M relation.

~To better grasp whats going - could you explain, what excess db updates you are seeing?~ Sorry, kinda overlooked the test PR...

martinlehoux commented 9 months ago

I understand it is tricky, but I have trouble understanding the exact example you provided

Do you know if those behaviours that are enforced now are well tested? I so, I could experiment and see if I break any will trying to fix my test

jerch commented 9 months ago

I looked into your test PR example, this is what happens under the hood (tested with sqlite):

        with CaptureQueriesContext(connection) as cqc:
            advert_1.tags.set([tag_2])
            for i, q in enumerate(cqc.captured_queries):
                print(i, q)

Lets start with what django does normally w'o any computed fields here:

0 {'sql': 'SELECT "klaus_tag"."id" FROM "klaus_tag" INNER JOIN "klaus_advert_tags" ON ("klaus_tag"."id" = "klaus_advert_tags"."tag_id") WHERE "klaus_advert_tags"."advert_id" = 1', 'time': '0.000'}
1 {'sql': 'SELECT "klaus_advert_tags"."id", "klaus_advert_tags"."advert_id", "klaus_advert_tags"."tag_id" FROM "klaus_advert_tags" WHERE ("klaus_advert_tags"."advert_id" = 1 AND "klaus_advert_tags"."tag_id" IN (1))', 'time': '0.000'}
2 {'sql': 'DELETE FROM "klaus_advert_tags" WHERE "klaus_advert_tags"."id" IN (1)', 'time': '0.000'}
3 {'sql': 'SELECT "klaus_advert_tags"."tag_id" FROM "klaus_advert_tags" WHERE ("klaus_advert_tags"."advert_id" = 1 AND "klaus_advert_tags"."tag_id" IN (2))', 'time': '0.000'}
4 {'sql': 'INSERT OR IGNORE INTO "klaus_advert_tags" ("advert_id", "tag_id") VALUES (1, 2)', 'time': '0.000'}

which translates roughly to this:

0: grab set of tag_id, which are currently attached to advert_id = 1 via a joint on the through table (I am currently not sure, why this is needed at all, at first glance it looks superfluous to me)
1: Now pull all objects from through table matching given advert_id and the set of tag_id (again, shouldn't this be enough in the first place?)
2: Delete set of ids on the through table from step 1.
3: This step is also unclear to me - it basically pulls a set of tag_id on advert_id = 1 matching the new tag ids to added in the next step. (Idk - maybe this is a security measure in case the db does not handle the INSERT OR IGNORE well.)
4: Now finally insert the new set of tag_id.

First takeaway - thats already pretty noisy. Have not thought that through in all details, but to me it seems this could be achieved in 3 queries instead of 5.

Second takeaway and really important for computed fields handling below - set is under the hood implemented as an explicit remove, then add on python side: https://github.com/django/django/blob/d1be05b3e9209fd0787841c71a95819d81061187/django/db/models/fields/related_descriptors.py#L1345-L1346. This means for computed fields, that the signals for remove and add come separately. As computed fields tries very hard to keep db state in sync between python calls, this means, that set in fact translates to remove --> update_dependent --> add --> update_dependent. Here we go - 2 calls to update_dependent:

# 0 from above
0 {'sql': 'SELECT "klaus_tag"."id" FROM "klaus_tag" INNER JOIN "klaus_advert_tags" ON ("klaus_tag"."id" = "klaus_advert_tags"."tag_id") WHERE "klaus_advert_tags"."advert_id" = 1', 'time': '0.000'}

# cf: pre_update_dependent collects removal ids
1 {'sql': 'SELECT "klaus_advert"."id" FROM "klaus_advert" WHERE "klaus_advert"."id" = 1', 'time': '0.000'}

# 1 & 2 from above
2 {'sql': 'SELECT "klaus_advert_tags"."id", "klaus_advert_tags"."advert_id", "klaus_advert_tags"."tag_id" FROM "klaus_advert_tags" WHERE ("klaus_advert_tags"."advert_id" = 1 AND "klaus_advert_tags"."tag_id" IN (1))', 'time': '0.000'}
3 {'sql': 'DELETE FROM "klaus_advert_tags" WHERE "klaus_advert_tags"."id" IN (1)', 'time': '0.000'}

# cf: resolver kicks in (everything within transaction) - update cfs because of removals
4 {'sql': 'SAVEPOINT "s139785238597632_x5"', 'time': '0.000'}
5 {'sql': 'SELECT DISTINCT "klaus_advert"."id", "klaus_advert"."name", "klaus_advert"."all_tags" FROM "klaus_advert" WHERE "klaus_advert"."id" IN (1)', 'time': '0.000'}
6 {'sql': 'SELECT "klaus_tag"."name" FROM "klaus_tag" INNER JOIN "klaus_advert_tags" ON ("klaus_tag"."id" = "klaus_advert_tags"."tag_id") WHERE "klaus_advert_tags"."advert_id" = 1', 'time': '0.000'}
7 {'sql': 'UPDATE "klaus_advert" SET "all_tags" = CASE WHEN ("klaus_advert"."id" = 1) THEN \'\' ELSE NULL END WHERE "klaus_advert"."id" IN (1)', 'time': '0.000'}
8 {'sql': 'RELEASE SAVEPOINT "s139785238597632_x5"', 'time': '0.000'}

# 3 & 4 from above
9 {'sql': 'SELECT "klaus_advert_tags"."tag_id" FROM "klaus_advert_tags" WHERE ("klaus_advert_tags"."advert_id" = 1 AND "klaus_advert_tags"."tag_id" IN (2))', 'time': '0.000'}
10 {'sql': 'INSERT OR IGNORE INTO "klaus_advert_tags" ("advert_id", "tag_id") VALUES (1, 2)', 'time': '0.000'}

# cf: resolver kicks in (everything within transaction) - update cfs because of additions
11 {'sql': 'SAVEPOINT "s139785238597632_x6"', 'time': '0.000'}
12 {'sql': 'SELECT DISTINCT "klaus_advert"."id", "klaus_advert"."name", "klaus_advert"."all_tags" FROM "klaus_advert" WHERE "klaus_advert"."id" = 1', 'time': '0.000'}
13 {'sql': 'SELECT "klaus_tag"."name" FROM "klaus_tag" INNER JOIN "klaus_advert_tags" ON ("klaus_tag"."id" = "klaus_advert_tags"."tag_id") WHERE "klaus_advert_tags"."advert_id" = 1', 'time': '0.000'}
14 {'sql': 'UPDATE "klaus_advert" SET "all_tags" = CASE WHEN ("klaus_advert"."id" = 1) THEN \'T2\' ELSE NULL END WHERE "klaus_advert"."id" IN (1)', 'time': '0.000'}
15 {'sql': 'RELEASE SAVEPOINT "s139785238597632_x6"', 'time': '0.000'}

Whether there is a smarter way to treat this - idk. At handler/signal level this def. lacks the knowledge about the action being a "forth-and-back" thingy cause by set. Ofc feel free to investigate, if this could be spotted somehow.

Edit: Thinking about the superfluous calls during set - this might be an attempt to narrow down work (see your own remark - "I must say that it only occurs when i set to a different list"). Edit2: And well - 5 queries vs. 15 with cf - thats the reason why the docs are full of warnings for M2M with computed fields. This will really explode, if you put just one more dependent cf into the setup (thats what my example above was about) :sweat_smile:

But to let you not in the rain here - you can speedup things alot by not using set at all, but to revert to bulk actions instead and triggering the cf update afterwards yourself. If you are interested in such a more explicit but faster way, just drop me a note and we work that out.

jerch commented 9 months ago

@martinlehoux

Here is a rewrite with bulk actions, where possible:

from django.test import TestCase
from . import models
from django.test.utils import CaptureQueriesContext
from django.db import connection
from computedfields.models import update_dependent

class TestAdvertTags(TestCase):
    def test(self):
        # grab the through model
        Through = models.Advert.tags.through

        models.run_counter = 0
        tag_1 = models.Tag.objects.create(name="T1")
        tag_2 = models.Tag.objects.create(name="T2")
        advert_1 = models.Advert.objects.create(name="A1")
        print('######', advert_1.all_tags)
        advert_2 = models.Advert.objects.create(name="A2")
        assert models.run_counter == 2, f"advert.run_counter={models.run_counter}"

        #advert_1.tags.add(tag_1)
        # add() now done by bulk_action + update_dependent (steps should be put into a transaction)
        Through.objects.bulk_create([Through(advert=advert_1, tag=tag_1)])
        update_dependent(advert_1, update_fields=['all_tags'])

        advert_1.refresh_from_db()
        print('######', advert_1.all_tags)
        assert models.run_counter == 3, f"advert.run_counter={models.run_counter}"

        with CaptureQueriesContext(connection) as cqc:
            #advert_1.tags.set([tag_2])
            # set() now done by bulk actions + update_dependent (steps should be put into a transaction)
            Through.objects.filter(advert=advert_1).delete()
            Through.objects.bulk_create([Through(advert=advert_1, tag=tag_2)])
            update_dependent(advert_1, update_fields=['all_tags'])

            assert models.run_counter == 4, f"advert.run_counter={models.run_counter}"
            for i, q in enumerate(cqc.captured_queries):
                print(i, q)

        advert_1.refresh_from_db()
        print('######', advert_1.all_tags)

This passes your test and does alot less on SQL:

0 {'sql': 'SELECT "klaus_advert_tags"."id", "klaus_advert_tags"."advert_id", "klaus_advert_tags"."tag_id" FROM "klaus_advert_tags" WHERE "klaus_advert_tags"."advert_id" = 1', 'time': '0.000'}
1 {'sql': 'DELETE FROM "klaus_advert_tags" WHERE "klaus_advert_tags"."id" IN (1)', 'time': '0.000'}
2 {'sql': 'INSERT INTO "klaus_advert_tags" ("advert_id", "tag_id") VALUES (1, 2) RETURNING "klaus_advert_tags"."id"', 'time': '0.000'}
3 {'sql': 'SELECT DISTINCT "klaus_advert"."id", "klaus_advert"."name", "klaus_advert"."all_tags" FROM "klaus_advert" WHERE "klaus_advert"."id" IN (1)', 'time': '0.000'}
4 {'sql': 'SELECT "klaus_tag"."name" FROM "klaus_tag" INNER JOIN "klaus_advert_tags" ON ("klaus_tag"."id" = "klaus_advert_tags"."tag_id") WHERE "klaus_advert_tags"."advert_id" = 1', 'time': '0.000'}
5 {'sql': 'UPDATE "klaus_advert" SET "all_tags" = CASE WHEN ("klaus_advert"."id" = 1) THEN \'T2\' ELSE NULL END WHERE "klaus_advert"."id" IN (1)', 'time': '0.000'}

Note that this is highly customized to the actions/data you have provided by the PR test. It currently assumes, that the M2M changeset has low intersection to the previous set. If the sets have in fact high intersections, then it should be done differently by pre-narrowing things on python first. Also note that the update_dependent usage here benefits from the additional knowledge, that only advert_1.all_tags needs to be updated. Thats def. not true for other M2M setups, so cannot be generalized yet as it is above. (For completeness - for a proper generalization you'd need to change your M2M to contain an explicit through model, change the depends argument accordingly, and then call update_dependent(queryset_with_new_through_instances) instead.)

Maybe this helps you to get better perf while keeping things mostly unchanged on other ends. And if you still suffer from update pressure even with this change, maybe try COMPUTEDFIELDS_FASTUPDATE = True in settings.py on top.

jerch commented 9 months ago

What also would be possible to mitigate the doubled processing for M2M set() - provide an alternate impl, that skips the _m2mchanged signal handler, instead doing the cf updates afterwards, schematically:

def m2m_set(manager, changeset):
    try:
        disable_m2m_handler()
        old = calc_preupdate(pulled_from_manager_changeset)
        manager.set(changeset)
        calc_update(pulled_manager_changeset, old=old)
    finally:
        enable_m2m_handler()

...
# usage:
# advert_1.tags.set([s1, s2]) now becomes
m2m_set(advert_1.tags, [s1, s2])

disable_m2m_handler and enable_m2m_handler are easy - they could simply flip a thread-local passthrough flag eval'ed by the default handler. calc_preupdate and calc_update are slightly more tricky - they are an amalgamation of the _preremove and _postadd code branches of the default handler. Most innocent looking is the try/finally construct here, but thats actually not that easy to get done right with django's transaction blocks in between - thus would need some testing.

jerch commented 9 months ago

@martinlehoux I did some checks for related managers in general. And as I thought, they currently dont work with the default bulk=True argument. Therefore I created #146. If that ticket leads to something usable, M2M also might benefit from that later on.

martinlehoux commented 9 months ago

Ok i didn't know about bulk=False I'll try to get a look at all this next week!

martinlehoux commented 9 months ago

After thinking a bit more, I'm not sure the example i set up really shows my issue in a real project. I'll try to upgrade the sample so that it is more accurate.

jerch commented 9 months ago

@martinlehoux Yes thats likely an issue with my hand-crafted optimization of your shown model/data - those hand-crafted solutions always end up being of limited usage for a broader adoption.

martinlehoux commented 9 months ago

I don't think it's easy to explain, but i'll give you a more precise example

class Tag(ComputedFieldsModel):
    name = models.CharField(max_length=32, unique=True)

class Advert(ComputedFieldsModel):
    name = models.CharField(max_length=32)

    tags = models.ManyToManyField(Tag, related_name="adverts")

    @computed(
        field=models.CharField(max_length=500),
        depends=[("tags", ["name"])],
    )
    def all_tags(self) -> str:
        global run_counter
        run_counter += 1
        if not self.pk:
            return ""
        return ", ".join(self.tags.values_list("name", flat=True))

class Room(ComputedFieldsModel):
    name = models.CharField(max_length=32)
    advert = models.ForeignKey(Advert, related_name="rooms", on_delete=models.CASCADE)

    @computed(
        field=models.BooleanField(),
        depends=[("advert.tags", ["name"])],
    )
    def is_ready(self) -> bool:
        global run_counter
        run_counter += 1
        if not self.pk:
            return False
        return self.advert.tags.filter(name="ready").exists()

class TestAdvertTags(TestCase):
    def test(self):
        tag_ready = models.Tag.objects.create(name="ready")
        advert_1 = models.Advert.objects.create(name="A1")
        room_11 = models.Room.objects.create(name="R11", advert=advert_1)
        advert_1.tags.add(tag_ready)
        advert_2 = models.Advert.objects.create(name="A2")
        room_21 = models.Room.objects.create(name="R21", advert=advert_2)
        advert_2.tags.add(tag_ready)

In this case, when I advert_2.tags.add(tag_ready), in m2m_handler, when action is post_add, I think there are too many things in the other_add map:

Advert: [[Advert 2], {all_tags}]
Room: [[Room 11, Room 21], {is_ready}]

I feel it should be possible to say that only Room 21 is concerned by the .add (and it is correctly computed in data_add)

Is there a use case I a not seeing that would explain this?

martinlehoux commented 7 months ago

Hello @jerch , do you have any guidance to help debug this issue ?

jerch commented 7 months ago

@martinlehoux Sorry, had not yet time to look further into it. Hopefully will get back it next week...

martinlehoux commented 7 months ago

When looking at the graph for my example, I feel something could be optimized.

The link from test_full.tag.adverts to test_full.room.is_ready seems useless: I only added depends=[("advert.tags", ["name"])]. What mandates to add this dependency to test_full.tag.adverts ?

martinlehoux commented 7 months ago

I updated the test case, but don't feel pressured haha

I'd really like to help, but I have trouble understanding where I should look. Maybe I didn't spend enough time

martinlehoux commented 6 months ago

I think I found what bugs me: with this Room model

class Room(ComputedFieldsModel):
    name = models.CharField(max_length=32)
    advert = models.ForeignKey(Advert, related_name="rooms", on_delete=models.CASCADE)

    @computed(
        field=models.BooleanField(),
        depends=[("advert.tags", ["name"])],
    )
    def is_ready(self) -> bool:

the graph for Tag looks like

{
    'name': {
        <class 'test_full.models.Room'>: ({'is_ready'}, {'advert__tags'}),
        <class 'test_full.models.Advert'>: ({'all_tags'}, {'tags'})
    },
    'adverts': {
        <class 'test_full.models.Room'>: ({'is_ready'}, {'advert__tags'}),
        <class 'test_full.models.Advert'>: ({'all_tags'}, {'tags'})
    }
}

I don't understand why is_ready depends on adverts. I this is_ready depends on advert and advert.tags, but not on advert.tags.adverts. I mean it's technically correct, but it seems redundant, and this seems the point where we loose the narrowing capabilities.

netzkolchose / django-computedfields

Many recomputations when setting many to many #145

Issue

Resolution