openzim / wp1

Wikipedia 1.0 engine & selection tools
https://wp1.openzim.org
GNU General Public License v2.0
24 stars 17 forks source link

Re-evaluate logic for deleting rows from `ratings` table #738

Open audiodude opened 4 months ago

audiodude commented 4 months ago

In #737, a WikiProject selection with multiple projects winds up with an article list that contains dozens of deleted articles.

This seems due to the fact that articles which have been deleted from English Wikipedia are never deleted from the ratings db table. The algorithm goes like this:

  1. Find all the articles in the "... by quality" category for the project
  2. Compare to all of the articles in the ratings table for that project
  3. For any articles that are in the db (ratings table) but not the category:
    1. Check if their quality/importance is already set to NotAClass. If so, skip
    2. Check if they have been moved in 3 different ways.
    3. If so, set the move data for that log
    4. Regardless, set their quality or importance rating (or both) to NotAClass.

There is additional separate logic for deleting articles with this WHERE clause:

        WHERE r_project=%(r_project)s AND
              (r_quality IS NULL OR r_quality=%(not_a_class)s) AND
              (r_importance IS NULL OR r_importance=%(not_a_class)s)

So the bug is that articles in different namespaces like Category pages:

---------+---------+
| r_project | r_namespace | r_article  | r_quality      | r_quality_timestamp  | r_importance | r_importance_timestamp | r_score |
+-----------+-------------+------------+----------------+----------------------+--------------+------------------------+---------+
| Theatre   |          14 | 1725_plays | Category-Class | 2011-07-07T04:40:07Z | NA-Class     | 2011-07-07T04:40:07Z   |       0 |
| Years     |          14 | 1725_plays | Category-Class | 2016-12-31T09:22:04Z | NA-Class     | 2016-12-31T09:22:04Z   |       0 |
+-----------+-------------+------------+----------------+----------------------+--------------+------------------------+---------+

End up never being deleted!

My guess is that we could change the WHERE clause to include an OR r_namespace > 0 clause.

audiodude commented 4 months ago

My evaluation above is incorrect, because of course NA-Class is not the same as NotA-Class. Of course!

So there's some other reason why 1725_plays, which is deleted, still appears in the ratings table.

The bot is running right now so I don't want to mess with debugging this just yet.