Many to many dependency triggers too many computations

martinlehoux commented 1 year ago

Hi there!

I'm not sure if this is a bug or an expected behaviour, but I needed to report it.

Here is the simple example

class HostingAdvertTypeTag(Model):
    objects: HostingAdvertTypeTagManager = HostingAdvertTypeTagManager()

    name = models.CharField(max_length=100, unique=True)

class HostingAdvert(Model):
    hosting_advert_type_tags = models.ManyToManyField(
        HostingAdvertTypeTag, blank=True, related_name="hosting_adverts"
    )

    @computed(
        ArrayField(models.CharField(null=False, blank=True, max_length=100), default=list, blank=True),
        depends=[
            ["hosting_advert_type_tags", ["name"]],
        ],
        prefetch_related=["hosting_advert_type_tags"],
    )
    def all_hosting_advert_type_tag_names(self) -> list[str]:
        ...

The issue is that when I use

hosting_advert.hosting_type_tags.set(HostingAdvertTypeTag.objects.filter(name__contains="POOL"))

there is a computation happening for all hosting adverts in my database, not only this one (as I would expect)

It seems that the dependency graph thinks this is the correct behaviour

What do you think? Is there something I didn't get about many to many fields?

The issue occurs with version 0.2.3, but also with version 0.2.1

jerch commented 1 year ago

there is a computation happening for all hosting adverts in my database, not only this one (as I would expect)

Tried to repro with a small example resembling your schemes:

class HaTag(models.Model):
    name = models.CharField(max_length=100, unique=True)

class Ha(ComputedFieldsModel):
    tags = models.ManyToManyField(HaTag, blank=True, related_name="ha_s")

    @computed(
        ArrayField(models.CharField(null=False, blank=True, max_length=100), default=list, blank=True),
        depends=[("tags", ["name"])],
    )
    def all_tags(self):
        return [] if not self.pk else list(self.tags.all().values_list('name', flat=True))

and get these queries for h_instance.tags.set(HaTag.objects.filter(name__contains='a')) (captured with CaptureQueriesContext and printed with json.dumps):

add on empty m2m field, triggering a single post_add signal:

[
{
"sql": "SELECT \"exampleapp_hatag\".\"id\", \"exampleapp_hatag\".\"name\" FROM \"exampleapp_hatag\" WHERE \"exampleapp_hatag\".\"name\"::text LIKE '%a%'",
"time": "0.002"
},
{
"sql": "SELECT \"exampleapp_hatag\".\"id\" FROM \"exampleapp_hatag\" INNER JOIN \"exampleapp_ha_tags\" ON (\"exampleapp_hatag\".\"id\" = \"exampleapp_ha_tags\".\"hatag_id\") WHERE \"exampleapp_ha_tags\".\"ha_id\" = 2",
"time": "0.002"
},
{
"sql": "SELECT \"exampleapp_ha_tags\".\"hatag_id\" FROM \"exampleapp_ha_tags\" WHERE (\"exampleapp_ha_tags\".\"ha_id\" = 2 AND \"exampleapp_ha_tags\".\"hatag_id\" IN (2, 3, 4, 5, 6))",
"time": "0.001"
},
{
"sql": "INSERT INTO \"exampleapp_ha_tags\" (\"ha_id\", \"hatag_id\") VALUES (2, 2), (2, 3), (2, 4), (2, 5), (2, 6) ON CONFLICT DO NOTHING",
"time": "0.002"
},
{
"sql": "SAVEPOINT \"s140703180003136_x2\"",
"time": "0.001"
},
{
"sql": "DECLARE \"_django_curs_140703180003136_sync_3\" NO SCROLL CURSOR WITHOUT HOLD FOR SELECT DISTINCT \"exampleapp_ha\".\"id\", \"exampleapp_ha\".\"all_tags\" FROM \"exampleapp_ha\" INNER JOIN \"exampleapp_ha_tags\" ON (\"exampleapp_ha\".\"id\" = \"exampleapp_ha_tags\".\"ha_id\") WHERE \"exampleapp_ha_tags\".\"hatag_id\" IN (SELECT U0.\"id\" FROM \"exampleapp_hatag\" U0 WHERE U0.\"id\" IN (2, 3, 4, 5, 6))",
"time": "0.002"
},
{
"sql": "SELECT \"exampleapp_hatag\".\"name\" FROM \"exampleapp_hatag\" INNER JOIN \"exampleapp_ha_tags\" ON (\"exampleapp_hatag\".\"id\" = \"exampleapp_ha_tags\".\"hatag_id\") WHERE \"exampleapp_ha_tags\".\"ha_id\" = 2",
"time": "0.001"
},
{
"sql": "UPDATE \"exampleapp_ha\" SET \"all_tags\"=CAST(\"d\".\"all_tags\" AS varchar(100)[]) FROM (VALUES (2,ARRAY['a0','a1','a2','a3','a4']::varchar(100)[])) AS \"d\" (\"id\",\"all_tags\") WHERE \"exampleapp_ha\".\"id\"=\"d\".\"id\"",
"time": "0.001"
},
{
"sql": "RELEASE SAVEPOINT \"s140703180003136_x2\"",
"time": "0.000"
}
]

add with non-empty m2m field, triggering pre_remove, post_remove and post_add signal:

[
{
"sql": "SELECT \"exampleapp_hatag\".\"id\", \"exampleapp_hatag\".\"name\" FROM \"exampleapp_hatag\" WHERE \"exampleapp_hatag\".\"name\"::text LIKE '%b%'",
"time": "0.001"
},
{
"sql": "SELECT \"exampleapp_hatag\".\"id\" FROM \"exampleapp_hatag\" INNER JOIN \"exampleapp_ha_tags\" ON (\"exampleapp_hatag\".\"id\" = \"exampleapp_ha_tags\".\"hatag_id\") WHERE \"exampleapp_ha_tags\".\"ha_id\" = 2",
"time": "0.002"
},
{
"sql": "DECLARE \"_django_curs_140703180003136_sync_4\" NO SCROLL CURSOR WITHOUT HOLD FOR SELECT \"exampleapp_ha\".\"id\" FROM \"exampleapp_ha\" INNER JOIN \"exampleapp_ha_tags\" ON (\"exampleapp_ha\".\"id\" = \"exampleapp_ha_tags\".\"ha_id\") WHERE \"exampleapp_ha_tags\".\"hatag_id\" IN (SELECT U0.\"id\" FROM \"exampleapp_hatag\" U0 WHERE U0.\"id\" IN (2, 3, 4, 5, 6))",
"time": "0.002"
},
{
"sql": "SELECT \"exampleapp_ha_tags\".\"id\", \"exampleapp_ha_tags\".\"ha_id\", \"exampleapp_ha_tags\".\"hatag_id\" FROM \"exampleapp_ha_tags\" WHERE (\"exampleapp_ha_tags\".\"ha_id\" = 2 AND \"exampleapp_ha_tags\".\"hatag_id\" IN (2, 3, 4, 5, 6))",
"time": "0.001"
},
{
"sql": "DELETE FROM \"exampleapp_ha_tags\" WHERE \"exampleapp_ha_tags\".\"id\" IN (25, 24, 23, 22, 21)",
"time": "0.001"
},
{
"sql": "SAVEPOINT \"s140703180003136_x3\"",
"time": "0.001"
},
{
"sql": "DECLARE \"_django_curs_140703180003136_sync_5\" NO SCROLL CURSOR WITHOUT HOLD FOR SELECT DISTINCT \"exampleapp_ha\".\"id\", \"exampleapp_ha\".\"all_tags\" FROM \"exampleapp_ha\" WHERE \"exampleapp_ha\".\"id\" IN (2)",
"time": "0.001"
},
{
"sql": "SELECT \"exampleapp_hatag\".\"name\" FROM \"exampleapp_hatag\" INNER JOIN \"exampleapp_ha_tags\" ON (\"exampleapp_hatag\".\"id\" = \"exampleapp_ha_tags\".\"hatag_id\") WHERE \"exampleapp_ha_tags\".\"ha_id\" = 2",
"time": "0.001"
},
{
"sql": "UPDATE \"exampleapp_ha\" SET \"all_tags\"=CAST(\"d\".\"all_tags\" AS varchar(100)[]) FROM (VALUES (2,'{}'::varchar(100)[])) AS \"d\" (\"id\",\"all_tags\") WHERE \"exampleapp_ha\".\"id\"=\"d\".\"id\"",
"time": "0.001"
},
{
"sql": "RELEASE SAVEPOINT \"s140703180003136_x3\"",
"time": "0.001"
},
{
"sql": "SELECT \"exampleapp_ha_tags\".\"hatag_id\" FROM \"exampleapp_ha_tags\" WHERE (\"exampleapp_ha_tags\".\"ha_id\" = 2 AND \"exampleapp_ha_tags\".\"hatag_id\" IN (7, 8, 9, 10, 11))",
"time": "0.001"
},
{
"sql": "INSERT INTO \"exampleapp_ha_tags\" (\"ha_id\", \"hatag_id\") VALUES (2, 7), (2, 8), (2, 9), (2, 10), (2, 11) ON CONFLICT DO NOTHING",
"time": "0.001"
},
{
"sql": "SAVEPOINT \"s140703180003136_x4\"",
"time": "0.001"
},
{
"sql": "DECLARE \"_django_curs_140703180003136_sync_6\" NO SCROLL CURSOR WITHOUT HOLD FOR SELECT DISTINCT \"exampleapp_ha\".\"id\", \"exampleapp_ha\".\"all_tags\" FROM \"exampleapp_ha\" INNER JOIN \"exampleapp_ha_tags\" ON (\"exampleapp_ha\".\"id\" = \"exampleapp_ha_tags\".\"ha_id\") WHERE \"exampleapp_ha_tags\".\"hatag_id\" IN (SELECT U0.\"id\" FROM \"exampleapp_hatag\" U0 WHERE U0.\"id\" IN (7, 8, 9, 10, 11))",
"time": "0.001"
},
{
"sql": "SELECT \"exampleapp_hatag\".\"name\" FROM \"exampleapp_hatag\" INNER JOIN \"exampleapp_ha_tags\" ON (\"exampleapp_hatag\".\"id\" = \"exampleapp_ha_tags\".\"hatag_id\") WHERE \"exampleapp_ha_tags\".\"ha_id\" = 2",
"time": "0.001"
},
{
"sql": "UPDATE \"exampleapp_ha\" SET \"all_tags\"=CAST(\"d\".\"all_tags\" AS varchar(100)[]) FROM (VALUES (2,ARRAY['b0','b1','b2','b3','b4']::varchar(100)[])) AS \"d\" (\"id\",\"all_tags\") WHERE \"exampleapp_ha\".\"id\"=\"d\".\"id\"",
"time": "0.001"
},
{
"sql": "RELEASE SAVEPOINT \"s140703180003136_x4\"",
"time": "0.000"
}
]

(the queries between the savepoints are generated by computedfields resolver, tested with fast_update enabled, thus the UPDATE queries might look different for you)

which both look correct to me. Note that for a non-empty m2m field the workload is much higher, which is due to the additional pre_remove + post_remove signals, before django issues the actual post_add to get the right value in place. But I cannot do much about this, it is how m2m fields are abstracted under hood in django.

I def. dont see updates for neighboring rows on Ha. This ofc can happen, if you have additional cfs defined somewhere, that rely on the result of all_tags.

Can you share the query log with the superfluous updates?

Edit: Could it be, that your array value calculation is not stable (not re-entrant) for the same tags at the m2m field? Would lead to an update from [1,2,3] vs. [2,1,3] inequality (can be fixed by ordering the entries).

martinlehoux commented 1 year ago

Hello @jerch !

Thanks a lot for the fast feedback. I created an example repo to show you the issue https://github.com/martinlehoux/test_computedfields

after installing deps & migrating, you can run ./manage.py < test.py to see how the script behave

Here is what i get

Advert object (9) setup
self=<Advert: Advert object (9)>, all_tags='tag1, tag2'
Advert object (10) setup
self=<Advert: Advert object (9)>, all_tags='tag1, tag2'
self=<Advert: Advert object (10)>, all_tags='tag1, tag2'
Advert object (9) update
self=<Advert: Advert object (9)>, all_tags='tag1'
self=<Advert: Advert object (10)>, all_tags='tag1, tag2'

And this is the behaviour I don't understand : why would Advert object (9) computed field be recomputed when Advert object (10) gets new tags?

Thanks again!

jerch commented 1 year ago

Thx for the example, I see now what you meant. It is indeed the case, that the m2m signal handler loses information when passing things over to resolver._querysets_for_update, which leads to a much broader selection of rows for update than needed.

Will see if I can narrow the selection further down without breaking too much in _querysets_for_update.

martinlehoux commented 1 year ago

Nice to hear! If you don't have time to do this, let me know and I will see if I can help with a fix

jerch commented 1 year ago

Created #131, if you want to test it.

martinlehoux commented 1 year ago

Yep it fixes it! Looks good :)

netzkolchose / django-computedfields

Many to many dependency triggers too many computations #130