Closed Keltena closed 1 month ago
oops, similar but different
I spent time investigating this tonight, out of sheer curiosity.
On Slack, @Keltena asked what the deal is with the self-join to pollvotes v2
in this code… we were all kinda stumped. I am now unstumped.
MariaDB [ifdb]> select gameid, userid, votedate from pollvotes where pollid='1h1zv1nreilkxmw3';
+------------------+------------------+---------------------+
| gameid | userid | votedate |
+------------------+------------------+---------------------+
| aearuuxv83plclpl | 0000000000000000 | 2024-08-20 00:39:42 |
| op0uw1gn1tjqmjt7 | 0000000000000000 | 2024-08-20 00:39:48 |
| aearuuxv83plclpl | 0000000000000001 | 2024-08-20 00:40:11 |
+------------------+------------------+---------------------+
3 rows in set (0.001 sec)
We start with a poll with three votes. The pollvotes table doesn't have a primary key (whyyyy??) so we'll tell them apart by date.
...000
), for Counterfeit Monkey (aearuuxv83plclpl), at 00:39:42test@ifdb.org
, for Counterfeit Monkey again, at 00:40:11MariaDB [ifdb]> select v.gameid, v.userid, v.votedate,
v2.gameid as gameid2, v2.userid as userid2, v2.votedate as votedate2
from pollvotes v
join pollvotes v2 on v2.pollid = v.pollid and v2.gameid = v.gameid
where v.pollid = '1h1zv1nreilkxmw3';
+------------------+------------------+---------------------+------------------+------------------+---------------------+
| gameid | userid | votedate | gameid2 | userid2 | votedate2 |
+------------------+------------------+---------------------+------------------+------------------+---------------------+
| aearuuxv83plclpl | 0000000000000000 | 2024-08-20 00:39:42 | aearuuxv83plclpl | 0000000000000000 | 2024-08-20 00:39:42 |
| aearuuxv83plclpl | 0000000000000000 | 2024-08-20 00:39:42 | aearuuxv83plclpl | 0000000000000001 | 2024-08-20 00:40:11 |
| op0uw1gn1tjqmjt7 | 0000000000000000 | 2024-08-20 00:39:48 | op0uw1gn1tjqmjt7 | 0000000000000000 | 2024-08-20 00:39:48 |
| aearuuxv83plclpl | 0000000000000001 | 2024-08-20 00:40:11 | aearuuxv83plclpl | 0000000000000000 | 2024-08-20 00:39:42 |
| aearuuxv83plclpl | 0000000000000001 | 2024-08-20 00:40:11 | aearuuxv83plclpl | 0000000000000001 | 2024-08-20 00:40:11 |
+------------------+------------------+---------------------+------------------+------------------+---------------------+
5 rows in set (0.000 sec)
In the second query, we query for all of the votes in a "self-join" on v2.gameid = v.gameid
, without grouping. There are five results.
The first vote, for Counterfeit Monkey, has two votes with the same gameid
: the first and the third vote. The second vote, for Anchorhead, only matches with itself. The third vote, has two votes with the same gameid
: the first and the third vote.
So, for each gameid
, some number N people have voted for that gameid
, and we'll expect N^2 rows with that gameid
in the result. There are five rows because two people voted for Counterfeit Monkey (2^2 = 4) and one person voted for Anchorhead (1^2 = 1), 4 + 1 = 5.
MariaDB [ifdb]> select count(v2.gameid) as votecount,
v.gameid, v.userid, v.votedate,
v2.gameid as gameid2, v2.userid as userid2, v2.votedate as votedate2
from pollvotes v
join pollvotes v2 on v2.pollid = v.pollid and v2.gameid = v.gameid
where v.pollid = '1h1zv1nreilkxmw3'
group by v.gameid, v.userid;
+-----------+------------------+------------------+---------------------+------------------+------------------+---------------------+
| votecount | gameid | userid | votedate | gameid2 | userid2 | votedate2 |
+-----------+------------------+------------------+---------------------+------------------+------------------+---------------------+
| 2 | aearuuxv83plclpl | 0000000000000000 | 2024-08-20 00:39:42 | aearuuxv83plclpl | 0000000000000000 | 2024-08-20 00:39:42 |
| 1 | op0uw1gn1tjqmjt7 | 0000000000000000 | 2024-08-20 00:39:48 | op0uw1gn1tjqmjt7 | 0000000000000000 | 2024-08-20 00:39:48 |
| 2 | aearuuxv83plclpl | 0000000000000001 | 2024-08-20 00:40:11 | aearuuxv83plclpl | 0000000000000000 | 2024-08-20 00:39:42 |
+-----------+------------------+------------------+---------------------+------------------+------------------+---------------------+
3 rows in set (0.002 sec)
In the third query, we group by v.gameid
and v.userid
, bringing us back to three rows with three unique v.votedate
values. We can see why this is if we manually group the five rows by eye.
v.gameid
and v.userid
.v.gameid
and v.userid
.But in the grouped query, we can add a count(v2.gameid) as votecount
. It's a count of how many votes were grouped for that v.gameid
.
But votecount
is a confusing idea here. What's being counted, exactly, and why? It's a little easier to see when you add in the order by
clause and see how the code iterates through the results.
MariaDB [ifdb]> select count(v2.gameid) as votecount,
v.gameid, v.userid, v.votedate,
v2.gameid as gameid2, v2.userid as userid2, v2.votedate as votedate2
from pollvotes v
join pollvotes v2 on v2.pollid = v.pollid and v2.gameid = v.gameid
where v.pollid = '1h1zv1nreilkxmw3'
group by v.gameid, v.userid
order by votecount desc, v.gameid, v.votedate;
+-----------+------------------+------------------+---------------------+------------------+------------------+---------------------+
| votecount | gameid | userid | votedate | gameid2 | userid2 | votedate2 |
+-----------+------------------+------------------+---------------------+------------------+------------------+---------------------+
| 2 | aearuuxv83plclpl | 0000000000000000 | 2024-08-20 00:39:42 | aearuuxv83plclpl | 0000000000000000 | 2024-08-20 00:39:42 |
| 2 | aearuuxv83plclpl | 0000000000000001 | 2024-08-20 00:40:11 | aearuuxv83plclpl | 0000000000000000 | 2024-08-20 00:39:42 |
| 1 | op0uw1gn1tjqmjt7 | 0000000000000000 | 2024-08-20 00:39:48 | op0uw1gn1tjqmjt7 | 0000000000000000 | 2024-08-20 00:39:48 |
+-----------+------------------+------------------+---------------------+------------------+------------------+---------------------+
3 rows in set (0.002 sec)
The code starts with $curGameID = '[start]';
and then says:
// if we're on to a new game, start the new game listing
if ($gameID != $curGameID) {
// ... blah blah blah ...
// show the basic game listing
echo "<td><a href=\"viewgame?id=$gameID\">"
. "<b>$gameTitle</b></a>, by $gameAuthor<br>"
. "<i>$voteCount vote" . ($voteCount != 1 ? "s" : "")
. "</i>";
// ... blah blah blah ...
// this is the current game now
$curGameID = $gameID;
}
So, as we walk through the three rows of results:
Now, finally, we can see why the sandbox exclusion isn't working quite right.
In this example, I marked test@ifdb.org
as sandboxed. The code query includes this line: and u.Sandbox in (0, $sandbox)
, so we can run this query to match it:
MariaDB [ifdb]> select v.gameid, v.userid, v.votedate,
v2.gameid as gameid2, v2.userid as userid2, v2.votedate as votedate2
from pollvotes v
join pollvotes v2 on v2.pollid = v.pollid and v2.gameid = v.gameid
join users u on u.id = v.userid
where v.pollid = '1h1zv1nreilkxmw3' and u.Sandbox in (0,0);
+------------------+------------------+---------------------+------------------+------------------+---------------------+
| gameid | userid | votedate | gameid2 | userid2 | votedate2 |
+------------------+------------------+---------------------+------------------+------------------+---------------------+
| aearuuxv83plclpl | 0000000000000000 | 2024-08-20 00:39:42 | aearuuxv83plclpl | 0000000000000000 | 2024-08-20 00:39:42 |
| aearuuxv83plclpl | 0000000000000000 | 2024-08-20 00:39:42 | aearuuxv83plclpl | 0000000000000001 | 2024-08-20 00:40:11 |
| op0uw1gn1tjqmjt7 | 0000000000000000 | 2024-08-20 00:39:48 | op0uw1gn1tjqmjt7 | 0000000000000000 | 2024-08-20 00:39:48 |
+------------------+------------------+---------------------+------------------+------------------+---------------------+
3 rows in set (0.003 sec)
Instead of five rows, we have three rows, dropping the last two rows of the previous ungrouped query, where the Test users appeared on the left-hand side. But we're not excluding the row where the Test user appears on the right-hand side.
So, when it comes time to group these three rows:
MariaDB [ifdb]> select count(v2.gameid) as votecount,
v.gameid, v.userid, v.votedate,
v2.gameid as gameid2, v2.userid as userid2, v2.votedate as votedate2
from pollvotes v
join pollvotes v2 on v2.pollid = v.pollid and v2.gameid = v.gameid
join users u on u.id = v.userid
where v.pollid = '1h1zv1nreilkxmw3' and u.Sandbox in (0,0)
group by v.gameid, v.userid;
+-----------+------------------+------------------+---------------------+------------------+------------------+---------------------+
| votecount | gameid | userid | votedate | gameid2 | userid2 | votedate2 |
+-----------+------------------+------------------+---------------------+------------------+------------------+---------------------+
| 2 | aearuuxv83plclpl | 0000000000000000 | 2024-08-20 00:39:42 | aearuuxv83plclpl | 0000000000000000 | 2024-08-20 00:39:42 |
| 1 | op0uw1gn1tjqmjt7 | 0000000000000000 | 2024-08-20 00:39:48 | op0uw1gn1tjqmjt7 | 0000000000000000 | 2024-08-20 00:39:48 |
+-----------+------------------+------------------+---------------------+------------------+------------------+---------------------+
We group the second row into the first row, yielding votecount
2. No sandboxed votes appear on the left-hand side, but they got counted in the votecount
.
So, what's the fix? We have to exclude sandboxed votes from the v2
query.
MariaDB [ifdb] > select count(v2.gameid) as votecount,
v.gameid, v.userid, v.votedate,
v2.gameid as gameid2, v2.userid as userid2, v2.votedate as votedate2
from pollvotes v
join pollvotes v2 on v2.pollid = v.pollid and v2.gameid = v.gameid
join users u on u.id = v.userid
join users u2 on u2.id = v2.userid
where v.pollid = '1h1zv1nreilkxmw3'
and u.Sandbox in (0,0)
and u2.Sandbox in (0, 0)
group by v.gameid, v.userid;
+-----------+------------------+------------------+---------------------+------------------+------------------+---------------------+
| votecount | gameid | userid | votedate | gameid2 | userid2 | votedate2 |
+-----------+------------------+------------------+---------------------+------------------+------------------+---------------------+
| 1 | aearuuxv83plclpl | 0000000000000000 | 2024-08-20 00:39:42 | aearuuxv83plclpl | 0000000000000000 | 2024-08-20 00:39:42 |
| 1 | op0uw1gn1tjqmjt7 | 0000000000000000 | 2024-08-20 00:39:48 | op0uw1gn1tjqmjt7 | 0000000000000000 | 2024-08-20 00:39:48 |
+-----------+------------------+------------------+---------------------+------------------+------------------+---------------------+
2 rows in set (0.001 sec)
That's better!
While poll pages hide votes from sandboxed users, they still count hidden votes towards each game's total. (For an example, compare listed vote count/rank vs. votes actually shown for this poll's top three games.)
Not sure why, but it looks like the count function that produces votecount (
count(v2.gameid)
) is including the sandboxed votes that the WHERE clause excludes. (Those rows are excluded from the final result; the count just seems to be executing before that somehow.)