jazzband / django-taggit

Simple tagging for django
https://django-taggit.readthedocs.io
BSD 3-Clause "New" or "Revised" License
3.27k stars 623 forks source link

Have tags default to being in creation order #871

Closed steverecio closed 1 week ago

steverecio commented 8 months ago

Fixes #870 to ensure that tag order is preserved by default

steverecio commented 8 months ago

So as of now the primary tests that are failing are the assertNumQueries and here which seems to occur because the ordering query causes duplicates but I tried adding distinct() everywhere I could but that doesn't seem to do it. I think its somewhere in the caching logic but can't quite figure it out.

@rtpg Are we able to modify the tests to change the num_queries from 32 to 34 and 10 to 11?

We might be able to get the num queries back down by first iterating through the tags arg in _to_tag_model_instances to parse out the str tags and doing one retrieval call and then creating any that are missing from that result and reassembling but honestly that will make this code even more unreadable and wouldn't really affect the worst case O(n) runtime of the function (list of all new tags). Feel like over-engineering to me at the cost of readability (already an issue in this codebase) but I'm new here so.... 😎

steverecio commented 8 months ago

Continuing to dig into this failure...

The issue appears to come from get_queryset. If I print the query getting executed, there is already a distinct clause but because of the through table entry on the multi-inheritance model the distinct clause is ineffective in deduplicating the query. However, we can't pass in the tag pk for deduplicating because we would have to pass in the tag pk to the order_by query for distinct() to accept that param.

Here is the query:

SELECT DISTINCT "taggit_tag"."id", "taggit_tag"."name", "taggit_tag"."slug", "taggit_taggeditem"."id" FROM "taggit_tag" INNER JOIN "taggit_taggeditem" ON ("taggit_tag"."id" = "taggit_taggeditem"."tag_id") INNER JOIN "django_content_type" ON ("taggit_taggeditem"."content_type_id" = "django_content_type"."id") WHERE ("django_content_type"."app_label" = tests AND "django_content_type"."model" = food AND "taggit_taggeditem"."object_id" = 1) ORDER BY "taggit_taggeditem"."id" ASC

UPDATE: I was able to get this passing by adding this deduplication snippet. The only remaining tests that fail are related to the asserting the number of queries executed. Are we able to loosen those test restrictions? @rtpg

rtpg commented 8 months ago

SELECT DISTINCT "taggit_tag"."id", "tests_taggedcustompk"."id" FROM "taggit_tag" INNER JOIN "tests_taggedcustompk" ON ("taggit_tag"."id" = "tests_taggedcustompk"."tag_id") INNER JOIN "django_content_type" ON ("tests_taggedcustompk"."content_type_id" = "django_content_type"."id") WHERE ("django_content_type"."app_label" = 'tests' AND "django_content_type"."model" = 'custompkfood') ORDER BY "tests_taggedcustompk"."id" ASC

On test_most_common_lazy, this query is getting generated. This looks like something pretty hefty (my first read is that this is going to read all of the tags of a certain type in from the database). I think we will want to figure out a way to avoid this (and somehow make this extra query lazy again, as I bet in that case the ORM will do the right thing).

I scrolled through the diff on the web just now and can't figure out exactly what line introduced this query though.

steverecio commented 8 months ago

SELECT DISTINCT "taggit_tag"."id", "tests_taggedcustompk"."id" FROM "taggit_tag" INNER JOIN "tests_taggedcustompk" ON ("taggit_tag"."id" = "tests_taggedcustompk"."tag_id") INNER JOIN "django_content_type" ON ("tests_taggedcustompk"."content_type_id" = "django_content_type"."id") WHERE ("django_content_type"."app_label" = 'tests' AND "django_content_type"."model" = 'custompkfood') ORDER BY "tests_taggedcustompk"."id" ASC

On test_most_common_lazy, this query is getting generated. This looks like something pretty hefty (my first read is that this is going to read all of the tags of a certain type in from the database). I think we will want to figure out a way to avoid this (and somehow make this extra query lazy again, as I bet in that case the ORM will do the right thing).

I scrolled through the diff on the web just now and can't figure out exactly what line introduced this query though.

This is related to the new get_queryset function: https://github.com/jazzband/django-taggit/pull/871/files#diff-2f4b79363da4c4d9048dbf680784ffbdee2400138f9672c19e2e7e73e59093d0R89

I have to execute the queryset (via values_list) in order to dedupe it because the distinct() function doesn't work with the ordering query attached when you query the tags on the model itself.

We are ensuring tags are a unique set when we tag an individual entry in the table (via _to_tag_model_instances). So if we order by the through table pk, we won't get any duplicates when querying a model entry. However, if we retrieve the tags for the entire table, the distinct clause sees the through table values from the order by clause and doesn't effectively deduplicate. (The issue lies in using the same get_queryset function for both obj.tags.all() and MyModel.tags.all())

For example, without the deduplication logic, the following will be true:

apple = Food.objects.create(name='apple')
apple.tags.set(['green', 'green'])
assertTagsEqual(apple.tags.all(), ['green'])
assertTagsEqual(Food.tags.all(), ['green'])

grape = Food.objects.create(name='grape')
grape.tags.add('green')
assertTagsEqual(Food.tags.all(), ['green', 'green'])

Does that make sense? One possible solution is that when the manager is attached to a model entry, we enforce tag ordering by the through table pk and when its attached to the model itself we don't enforce any ordering. Honestly, I'm not sure what the best solution for this is but I think I understand the underlying issue at the very least.

UPDATE: I was able to fix most of the tests by only adding the ordering if the instance is not None. Only remaining issue is the 34 vs 32 queries.

steverecio commented 7 months ago

@rtpg friendly ping :)

I was able to fix most of the tests by only applying the ordering if the instance is not None. This way, the ordering is not applied automatically when querying for tags attached to a model, only for tags attached to a model instance. Only remaining issue is the 34 vs 32 queries.

stefanivic commented 1 month ago

@steverecio @rtpg any updates on this, we ran into the same issue so was wondering if this feature was abandoned or will not be merged due to some other reason ? I can take it over if you moved on.

rtpg commented 1 month ago

@stefanivic could you specifically describe the issue you had, so we can design some test cases and properly say what issue we're trying to fix?

Basically, would like to fix any issues, but do not want to add any extra queries, so I think we might need to start a new branch. But would like to have a specific ask here

stefanivic commented 1 month ago

@stefanivic could you specifically describe the issue you had, so we can design some test cases and properly say what issue we're trying to fix?

Basically, would like to fix any issues, but do not want to add any extra queries, so I think we might need to start a new branch. But would like to have a specific ask here

Pretty much what the referenced issue and PR talk about. When adding tags, their order is not respected, instead the order seems to change during the save. For example if I add the tags 1, 3, 5, 2, 4, after the save these will be listed as 1, 2, 3, 4, 5. Based on what I have seen it seems to order them by name ( alphabetical asc ). If it's required I can make a small demo to show.

Our clients tend to define tags in the order of importance to them, so the ones they want people to click the most go on the first two positions. But after the reorder they can end up last.

I have tried adding a sort_order (int) field to the ItemBase and then returning them in the specific order, but the Set() seems to reorder them regardless.

steverecio commented 1 month ago

@stefanivic could you specifically describe the issue you had, so we can design some test cases and properly say what issue we're trying to fix?

Basically, would like to fix any issues, but do not want to add any extra queries, so I think we might need to start a new branch. But would like to have a specific ask here

@rtpg There are some new test cases in this PR that highlight the issue.

rtpg commented 1 week ago

This one was implemented in #892