Open martinlehoux opened 8 months ago
Yeah in general there is a high chance to miss updates at certain ends with M2M, kinda learned that the hard when I implemented it initially. M2Ms are esp. tricky, if cfs span across them in interdepencies or even link back like this:
A-cf1 :n --> n: B-cf1 :n --> n: A-cf2
which seems awkward at first, but can happen, if you have various cf aggregates on model A
and B
pulling data forth and back across the M2M relation.
~To better grasp whats going - could you explain, what excess db updates you are seeing?~ Sorry, kinda overlooked the test PR...
I understand it is tricky, but I have trouble understanding the exact example you provided
Do you know if those behaviours that are enforced now are well tested? I so, I could experiment and see if I break any will trying to fix my test
I looked into your test PR example, this is what happens under the hood (tested with sqlite):
with CaptureQueriesContext(connection) as cqc:
advert_1.tags.set([tag_2])
for i, q in enumerate(cqc.captured_queries):
print(i, q)
Lets start with what django does normally w'o any computed fields here:
0 {'sql': 'SELECT "klaus_tag"."id" FROM "klaus_tag" INNER JOIN "klaus_advert_tags" ON ("klaus_tag"."id" = "klaus_advert_tags"."tag_id") WHERE "klaus_advert_tags"."advert_id" = 1', 'time': '0.000'}
1 {'sql': 'SELECT "klaus_advert_tags"."id", "klaus_advert_tags"."advert_id", "klaus_advert_tags"."tag_id" FROM "klaus_advert_tags" WHERE ("klaus_advert_tags"."advert_id" = 1 AND "klaus_advert_tags"."tag_id" IN (1))', 'time': '0.000'}
2 {'sql': 'DELETE FROM "klaus_advert_tags" WHERE "klaus_advert_tags"."id" IN (1)', 'time': '0.000'}
3 {'sql': 'SELECT "klaus_advert_tags"."tag_id" FROM "klaus_advert_tags" WHERE ("klaus_advert_tags"."advert_id" = 1 AND "klaus_advert_tags"."tag_id" IN (2))', 'time': '0.000'}
4 {'sql': 'INSERT OR IGNORE INTO "klaus_advert_tags" ("advert_id", "tag_id") VALUES (1, 2)', 'time': '0.000'}
which translates roughly to this:
tag_id
, which are currently attached to advert_id = 1
via a joint on the through table (I am currently not sure, why this is needed at all, at first glance it looks superfluous to me)advert_id
and the set of tag_id
(again, shouldn't this be enough in the first place?)tag_id
on advert_id = 1
matching the new tag ids to added in the next step. (Idk - maybe this is a security measure in case the db does not handle the INSERT OR IGNORE well.)tag_id
.First takeaway - thats already pretty noisy. Have not thought that through in all details, but to me it seems this could be achieved in 3 queries instead of 5.
Second takeaway and really important for computed fields handling below - set
is under the hood implemented as an explicit remove
, then add
on python side: https://github.com/django/django/blob/d1be05b3e9209fd0787841c71a95819d81061187/django/db/models/fields/related_descriptors.py#L1345-L1346. This means for computed fields, that the signals for remove and add come separately. As computed fields tries very hard to keep db state in sync between python calls, this means, that set
in fact translates to remove --> update_dependent --> add --> update_dependent
. Here we go - 2 calls to update_dependent
:
# 0 from above
0 {'sql': 'SELECT "klaus_tag"."id" FROM "klaus_tag" INNER JOIN "klaus_advert_tags" ON ("klaus_tag"."id" = "klaus_advert_tags"."tag_id") WHERE "klaus_advert_tags"."advert_id" = 1', 'time': '0.000'}
# cf: pre_update_dependent collects removal ids
1 {'sql': 'SELECT "klaus_advert"."id" FROM "klaus_advert" WHERE "klaus_advert"."id" = 1', 'time': '0.000'}
# 1 & 2 from above
2 {'sql': 'SELECT "klaus_advert_tags"."id", "klaus_advert_tags"."advert_id", "klaus_advert_tags"."tag_id" FROM "klaus_advert_tags" WHERE ("klaus_advert_tags"."advert_id" = 1 AND "klaus_advert_tags"."tag_id" IN (1))', 'time': '0.000'}
3 {'sql': 'DELETE FROM "klaus_advert_tags" WHERE "klaus_advert_tags"."id" IN (1)', 'time': '0.000'}
# cf: resolver kicks in (everything within transaction) - update cfs because of removals
4 {'sql': 'SAVEPOINT "s139785238597632_x5"', 'time': '0.000'}
5 {'sql': 'SELECT DISTINCT "klaus_advert"."id", "klaus_advert"."name", "klaus_advert"."all_tags" FROM "klaus_advert" WHERE "klaus_advert"."id" IN (1)', 'time': '0.000'}
6 {'sql': 'SELECT "klaus_tag"."name" FROM "klaus_tag" INNER JOIN "klaus_advert_tags" ON ("klaus_tag"."id" = "klaus_advert_tags"."tag_id") WHERE "klaus_advert_tags"."advert_id" = 1', 'time': '0.000'}
7 {'sql': 'UPDATE "klaus_advert" SET "all_tags" = CASE WHEN ("klaus_advert"."id" = 1) THEN \'\' ELSE NULL END WHERE "klaus_advert"."id" IN (1)', 'time': '0.000'}
8 {'sql': 'RELEASE SAVEPOINT "s139785238597632_x5"', 'time': '0.000'}
# 3 & 4 from above
9 {'sql': 'SELECT "klaus_advert_tags"."tag_id" FROM "klaus_advert_tags" WHERE ("klaus_advert_tags"."advert_id" = 1 AND "klaus_advert_tags"."tag_id" IN (2))', 'time': '0.000'}
10 {'sql': 'INSERT OR IGNORE INTO "klaus_advert_tags" ("advert_id", "tag_id") VALUES (1, 2)', 'time': '0.000'}
# cf: resolver kicks in (everything within transaction) - update cfs because of additions
11 {'sql': 'SAVEPOINT "s139785238597632_x6"', 'time': '0.000'}
12 {'sql': 'SELECT DISTINCT "klaus_advert"."id", "klaus_advert"."name", "klaus_advert"."all_tags" FROM "klaus_advert" WHERE "klaus_advert"."id" = 1', 'time': '0.000'}
13 {'sql': 'SELECT "klaus_tag"."name" FROM "klaus_tag" INNER JOIN "klaus_advert_tags" ON ("klaus_tag"."id" = "klaus_advert_tags"."tag_id") WHERE "klaus_advert_tags"."advert_id" = 1', 'time': '0.000'}
14 {'sql': 'UPDATE "klaus_advert" SET "all_tags" = CASE WHEN ("klaus_advert"."id" = 1) THEN \'T2\' ELSE NULL END WHERE "klaus_advert"."id" IN (1)', 'time': '0.000'}
15 {'sql': 'RELEASE SAVEPOINT "s139785238597632_x6"', 'time': '0.000'}
Whether there is a smarter way to treat this - idk. At handler/signal level this def. lacks the knowledge about the action being a "forth-and-back" thingy cause by set
. Ofc feel free to investigate, if this could be spotted somehow.
Edit: Thinking about the superfluous calls during set
- this might be an attempt to narrow down work (see your own remark - "I must say that it only occurs when i set to a different list").
Edit2: And well - 5 queries vs. 15 with cf - thats the reason why the docs are full of warnings for M2M with computed fields. This will really explode, if you put just one more dependent cf into the setup (thats what my example above was about) :sweat_smile:
But to let you not in the rain here - you can speedup things alot by not using set
at all, but to revert to bulk actions instead and triggering the cf update afterwards yourself. If you are interested in such a more explicit but faster way, just drop me a note and we work that out.
@martinlehoux
Here is a rewrite with bulk actions, where possible:
from django.test import TestCase
from . import models
from django.test.utils import CaptureQueriesContext
from django.db import connection
from computedfields.models import update_dependent
class TestAdvertTags(TestCase):
def test(self):
# grab the through model
Through = models.Advert.tags.through
models.run_counter = 0
tag_1 = models.Tag.objects.create(name="T1")
tag_2 = models.Tag.objects.create(name="T2")
advert_1 = models.Advert.objects.create(name="A1")
print('######', advert_1.all_tags)
advert_2 = models.Advert.objects.create(name="A2")
assert models.run_counter == 2, f"advert.run_counter={models.run_counter}"
#advert_1.tags.add(tag_1)
# add() now done by bulk_action + update_dependent (steps should be put into a transaction)
Through.objects.bulk_create([Through(advert=advert_1, tag=tag_1)])
update_dependent(advert_1, update_fields=['all_tags'])
advert_1.refresh_from_db()
print('######', advert_1.all_tags)
assert models.run_counter == 3, f"advert.run_counter={models.run_counter}"
with CaptureQueriesContext(connection) as cqc:
#advert_1.tags.set([tag_2])
# set() now done by bulk actions + update_dependent (steps should be put into a transaction)
Through.objects.filter(advert=advert_1).delete()
Through.objects.bulk_create([Through(advert=advert_1, tag=tag_2)])
update_dependent(advert_1, update_fields=['all_tags'])
assert models.run_counter == 4, f"advert.run_counter={models.run_counter}"
for i, q in enumerate(cqc.captured_queries):
print(i, q)
advert_1.refresh_from_db()
print('######', advert_1.all_tags)
This passes your test and does alot less on SQL:
0 {'sql': 'SELECT "klaus_advert_tags"."id", "klaus_advert_tags"."advert_id", "klaus_advert_tags"."tag_id" FROM "klaus_advert_tags" WHERE "klaus_advert_tags"."advert_id" = 1', 'time': '0.000'}
1 {'sql': 'DELETE FROM "klaus_advert_tags" WHERE "klaus_advert_tags"."id" IN (1)', 'time': '0.000'}
2 {'sql': 'INSERT INTO "klaus_advert_tags" ("advert_id", "tag_id") VALUES (1, 2) RETURNING "klaus_advert_tags"."id"', 'time': '0.000'}
3 {'sql': 'SELECT DISTINCT "klaus_advert"."id", "klaus_advert"."name", "klaus_advert"."all_tags" FROM "klaus_advert" WHERE "klaus_advert"."id" IN (1)', 'time': '0.000'}
4 {'sql': 'SELECT "klaus_tag"."name" FROM "klaus_tag" INNER JOIN "klaus_advert_tags" ON ("klaus_tag"."id" = "klaus_advert_tags"."tag_id") WHERE "klaus_advert_tags"."advert_id" = 1', 'time': '0.000'}
5 {'sql': 'UPDATE "klaus_advert" SET "all_tags" = CASE WHEN ("klaus_advert"."id" = 1) THEN \'T2\' ELSE NULL END WHERE "klaus_advert"."id" IN (1)', 'time': '0.000'}
Note that this is highly customized to the actions/data you have provided by the PR test. It currently assumes, that the M2M changeset has low intersection to the previous set. If the sets have in fact high intersections, then it should be done differently by pre-narrowing things on python first. Also note that the update_dependent
usage here benefits from the additional knowledge, that only advert_1.all_tags
needs to be updated. Thats def. not true for other M2M setups, so cannot be generalized yet as it is above. (For completeness - for a proper generalization you'd need to change your M2M to contain an explicit through model, change the depends argument accordingly, and then call update_dependent(queryset_with_new_through_instances)
instead.)
Maybe this helps you to get better perf while keeping things mostly unchanged on other ends. And if you still suffer from update pressure even with this change, maybe try COMPUTEDFIELDS_FASTUPDATE = True
in settings.py on top.
What also would be possible to mitigate the doubled processing for M2M set()
- provide an alternate impl, that skips the _m2mchanged signal handler, instead doing the cf updates afterwards, schematically:
def m2m_set(manager, changeset):
try:
disable_m2m_handler()
old = calc_preupdate(pulled_from_manager_changeset)
manager.set(changeset)
calc_update(pulled_manager_changeset, old=old)
finally:
enable_m2m_handler()
...
# usage:
# advert_1.tags.set([s1, s2]) now becomes
m2m_set(advert_1.tags, [s1, s2])
disable_m2m_handler
and enable_m2m_handler
are easy - they could simply flip a thread-local passthrough flag eval'ed by the default handler.
calc_preupdate
and calc_update
are slightly more tricky - they are an amalgamation of the _preremove and _postadd code branches of the default handler.
Most innocent looking is the try/finally
construct here, but thats actually not that easy to get done right with django's transaction blocks in between - thus would need some testing.
@martinlehoux I did some checks for related managers in general. And as I thought, they currently dont work with the default bulk=True
argument. Therefore I created #146. If that ticket leads to something usable, M2M also might benefit from that later on.
Ok i didn't know about bulk=False I'll try to get a look at all this next week!
After thinking a bit more, I'm not sure the example i set up really shows my issue in a real project. I'll try to upgrade the sample so that it is more accurate.
@martinlehoux Yes thats likely an issue with my hand-crafted optimization of your shown model/data - those hand-crafted solutions always end up being of limited usage for a broader adoption.
I don't think it's easy to explain, but i'll give you a more precise example
class Tag(ComputedFieldsModel):
name = models.CharField(max_length=32, unique=True)
class Advert(ComputedFieldsModel):
name = models.CharField(max_length=32)
tags = models.ManyToManyField(Tag, related_name="adverts")
@computed(
field=models.CharField(max_length=500),
depends=[("tags", ["name"])],
)
def all_tags(self) -> str:
global run_counter
run_counter += 1
if not self.pk:
return ""
return ", ".join(self.tags.values_list("name", flat=True))
class Room(ComputedFieldsModel):
name = models.CharField(max_length=32)
advert = models.ForeignKey(Advert, related_name="rooms", on_delete=models.CASCADE)
@computed(
field=models.BooleanField(),
depends=[("advert.tags", ["name"])],
)
def is_ready(self) -> bool:
global run_counter
run_counter += 1
if not self.pk:
return False
return self.advert.tags.filter(name="ready").exists()
class TestAdvertTags(TestCase):
def test(self):
tag_ready = models.Tag.objects.create(name="ready")
advert_1 = models.Advert.objects.create(name="A1")
room_11 = models.Room.objects.create(name="R11", advert=advert_1)
advert_1.tags.add(tag_ready)
advert_2 = models.Advert.objects.create(name="A2")
room_21 = models.Room.objects.create(name="R21", advert=advert_2)
advert_2.tags.add(tag_ready)
In this case, when I advert_2.tags.add(tag_ready)
, in m2m_handler, when action is post_add
, I think there are too many things in the other_add
map:
Advert: [[Advert 2], {all_tags}]
Room: [[Room 11, Room 21], {is_ready}]
I feel it should be possible to say that only Room 21 is concerned by the .add
(and it is correctly computed in data_add
)
Is there a use case I a not seeing that would explain this?
Hello @jerch , do you have any guidance to help debug this issue ?
@martinlehoux Sorry, had not yet time to look further into it. Hopefully will get back it next week...
When looking at the graph for my example, I feel something could be optimized.
The link from test_full.tag.adverts
to test_full.room.is_ready
seems useless: I only added depends=[("advert.tags", ["name"])]
. What mandates to add this dependency to test_full.tag.adverts
?
I updated the test case, but don't feel pressured haha
I'd really like to help, but I have trouble understanding where I should look. Maybe I didn't spend enough time
I think I found what bugs me: with this Room model
class Room(ComputedFieldsModel):
name = models.CharField(max_length=32)
advert = models.ForeignKey(Advert, related_name="rooms", on_delete=models.CASCADE)
@computed(
field=models.BooleanField(),
depends=[("advert.tags", ["name"])],
)
def is_ready(self) -> bool:
the graph for Tag looks like
{
'name': {
<class 'test_full.models.Room'>: ({'is_ready'}, {'advert__tags'}),
<class 'test_full.models.Advert'>: ({'all_tags'}, {'tags'})
},
'adverts': {
<class 'test_full.models.Room'>: ({'is_ready'}, {'advert__tags'}),
<class 'test_full.models.Advert'>: ({'all_tags'}, {'tags'})
}
}
I don't understand why is_ready
depends on adverts
. I this is_ready
depends on advert
and advert.tags
, but not on advert.tags.adverts
. I mean it's technically correct, but it seems redundant, and this seems the point where we loose the narrowing capabilities.
Issue
tags.set(...)
(actually, it's a.set
trigger by an admin form), I see my computed field recomputed for all my advertspost_remove
handler is the one triggering all these updatesResolution