nystudio107 / craft-similar

Similar for Craft lets you find elements, Entries, Categories, Commerce Products, etc, that are similar, based on... other related elements.
https://nystudio107.com/
MIT License
26 stars 5 forks source link

Incorrect results (Incorrect `count` of shared relations, incorrect order) #6

Closed michaelrog closed 6 years ago

michaelrog commented 6 years ago

The find method does not product correct results.

Here's my test setup:

--- Source Product ---

{% set testProduct = craft.products({id: 19641}).one() %}

{{ testProduct.title }}

{{ testProduct.productKeywords.count }} tags

--- Similar Products ---

{% set similarProducts = craft.similar.find({ element: testProduct, context: testProduct.productKeywords, criteria: []}) %}

{% for similarProduct in similarProducts %}

    {{ similarProduct.title }}

    {{ similarProduct.productKeywords.count }} tags
    {{ similarProduct.count }} tags supposedly in common
    {{ testProduct.productKeywords.ids | intersect(similarProduct.productKeywords.ids) | length }} tags actually in common lol

{% endfor %}

The results of this test, in my case, are something like:

--- Source Product ---

Black Board with Smaller Black Boards
6 tags

--- Similar Products ---

Kids Walking Out of Decorated Building
39 tags 
42 tags supposedly in common 
2 tags actually in common lol

Wood Storage Box in a Kids Playroom
36 tags 
39 tags supposedly in common 
3 tags actually in common lol

Two Girls Sledding Down a Hill
35 tags 
38 tags supposedly in common 
2 tags actually in common lol

The count values are not correct.

(In addition to being incorrect in their own faces, they also produce an incorrect order of the results, which is especially problematic if a limit is applied in the criteria.)

khalwat commented 6 years ago

PRs welcome!

Anubarak commented 6 years ago

@michaelrog do you have a multisite setup by chance or do you use multiple different fields to relate your elements instead of just a single one?

michaelrog commented 6 years ago

Results are consistently incorrect regardless of single/multiple sites or single/multiple fields. (I tested variants of all the above.)

I even reproduced the bug using a DB dump of nystudio107.com from @khalwat. 😕

khalwat commented 6 years ago

Where's the PR @michaelrog ? 🥇

Anubarak commented 6 years ago

@michaelrog would you mind sending me this dump? Because I was not able to reproduce it. I created 100 entries and 8 tags and linked those randomly but every test was successful

It would be much easier to have a good test case

michaelrog commented 6 years ago

@Anubarak — It's @khalwat's site, so I'll defer to him to share the db dump... But the results when I ran my test template with my own site were the same, and I suspect it would be likewise with your site: Notably, for each result, the number of purported "tags in common" is greater than the total number of tags assigned to either the original source element or the related one. (Actually comparing the IDs between the source element and each related element reveals the true count of shared tags.)

@khalwat — when I get back from vacation, and finish porting my Craft 3 plugins, I'll try to rewrite your SQL. For now I'll stick to just complaining about it. <3

khalwat commented 6 years ago

I think he's right @Anubarak -- for instance, check out this article here:

https://nystudio107.com/blog/using-vuejs-2-0-with-craft-cms

The tags are:

#craftcms #vuejs #frontend

...yet the most "similar" articles it finds are:

...which really aren't similar at all. They are just the 3 most recent articles. It's using this code:

            {% set limitCriteria = craft.entries.limit(3) %}
            {% set similarEntriesByTags = craft.similar.find({ element: entry, context: entry.blogTags, criteria: limitCriteria }) %}

Basically, it's asking for the 3 most recent entries that are related to this blog's tags. The resulting SQL is:

'SELECT `elements`.`id`, `elements_sites`.`siteId`, COUNT(*) as count
FROM (SELECT `elements`.`id` AS `elementsId`, `elements_sites`.`id` AS `elementsSitesId`, `content`.`id` AS `contentId`
FROM `craft_elements` `elements`
INNER JOIN `craft_entries` `entries` ON `entries`.`id` = `elements`.`id`
INNER JOIN `craft_elements_sites` `elements_sites` ON `elements_sites`.`elementId` = `elements`.`id`
INNER JOIN `craft_content` `content` ON `content`.`elementId` = `elements`.`id`
LEFT JOIN `craft_structureelements` `structureelements` ON `structureelements`.`elementId` = `elements`.`id`
LEFT JOIN `craft_relations` ON elements.id=`craft_relations`.sourceId
WHERE (`elements_sites`.`siteId`=\'1\') AND ((elements.id != \'315\') AND (`craft_relations`.`targetId` IN (\'79\', \'313\', \'95\'))) AND (`content`.`siteId`=\'1\') AND (`elements`.`archived`=FALSE) AND (((`elements`.`enabled`=TRUE) AND (`elements_sites`.`enabled`=TRUE)) AND (`entries`.`postDate` <= \'2018-08-04 13:22:03\') AND ((`entries`.`expiryDate` IS NULL) OR (`entries`.`expiryDate` > \'2018-08-04 13:22:03\'))) AND (`elements_sites`.`enabled`=TRUE)
GROUP BY `elements`.`id`
ORDER BY `structureelements`.`lft`, `entries`.`postDate` DESC
LIMIT 3) `subquery`
INNER JOIN `craft_entries` `entries` ON `entries`.`id` = `subquery`.`elementsId`
INNER JOIN `craft_elements` `elements` ON `elements`.`id` = `subquery`.`elementsId`
INNER JOIN `craft_elements_sites` `elements_sites` ON `elements_sites`.`id` = `subquery`.`elementsSitesId`
INNER JOIN `craft_content` `content` ON `content`.`id` = `subquery`.`contentId`
LEFT JOIN `craft_structureelements` `structureelements` ON `structureelements`.`elementId` = `subquery`.`elementsId`
LEFT JOIN `craft_relations` ON elements.id=`craft_relations`.sourceId
GROUP BY `craft_relations`.`sourceId`
ORDER BY `count` DESC'
khalwat commented 6 years ago

So interestingly, the count that is returned is what I think is wrong here. If I change the code to:

                {% set limitCriteria = craft.entries.limit(100) %}
                {% set similarEntriesByTags = craft.similar.find({ element: entry, context: entry.blogTags, criteria: limitCriteria }) %}

I end up with an array of entries, all of which have a count of 5 (??), except for the last item in the array, which has a count of 4.

khalwat commented 6 years ago

The part I don't understand is why we're ending up with all of the entries in the found set to begin with. It's clearly doing:

WHERE (`elements_sites`.`siteId`=\'1\') AND ((elements.id != \'315\') AND (`craft_relations`.`targetId` IN (\'79\', \'313\', \'95\'))) AND (`content`.`siteId`=\'1\') AND (`elements`.`archived`=FALSE) AND (((`elements`.`enabled`=TRUE) AND (`elements_sites`.`enabled`=TRUE)) AND (`entries`.`postDate` <= \'2018-08-04 13:22:03\') AND ((`entries`.`expiryDate` IS NULL) OR (`entries`.`expiryDate` > \'2018-08-04 13:22:03\'))) AND (`elements_sites`.`enabled`=TRUE)

...to limit what it finds to elements that have relations to just: (craft_relations.targetIdIN (\'79\', \'313\', \'95\')))

Which are:

...yet it's returning elements that do not have craft_relations to these targetIds at all....

khalwat commented 6 years ago

So here's a thought... @Anubarak added in:

    $query->subQuery->groupBy('elements.id');

To group all of the results that have the same id together, but it almost feels like the COUNT is then becoming cumulative

It's unclear to me why it'd be capped at 5 but I think this might be what is going on...

khalwat commented 6 years ago

So I think it's actually the opposite that is happening. What seems to be going on is that for whatever reason, elements that have 1 thing in common end up with a count of 5. Elements that have 2 things in common end up with a count of 10, elements with 3 things in common end up with a count of 15, and so on.

The subquery:

        $query->subQuery->groupBy('elements.id');

...causes them to be grouped such that the count is always the lowest (first?), so they are all the same in terms of ranking.

khalwat commented 6 years ago

So it looks to me like the WHERE part of the query is actually correct, and it is returning only elements that have the appropriate relations.

The incorrect part is currently how we're grouping the results together, which is not using the correct (highest) count.

This is related to: https://github.com/nystudio107/craft-similar/issues/1

...in that if we could move the limit criteria to the end of the query, this would solve the issue.

khalwat commented 6 years ago

Fixed in https://github.com/nystudio107/craft-similar/commit/066948c78090a49280f3b658c538fa1716f20bc1 thanks to @Anubarak