openstreetmap / operations

OSMF Operations Working Group issue tracking
https://operations.osmfoundation.org/
99 stars 12 forks source link

Category problems on the wiki. job queue #165

Closed harry-wood closed 5 years ago

harry-wood commented 7 years ago

Some problems with wiki categories described here: https://wiki.openstreetmap.org/wiki/User_talk:Harry_Wood#Wiki_categories_seem_to_be_broken

The suggestion about it being a problem with the job queue, sounds plausible. MediaWiki has job queue for background tasks. Maybe it stopped running them recently?

mmd-osm commented 7 years ago

Further discussion on this topic, see http://wiki.openstreetmap.org/wiki/User_talk:Reneman#Comments_with_your_recent_deletions.2C_and_MediaWiki_1.28_problems

Can someone with server admin rights please have a look and do some follow on that discussion?

Thanks!

//ping @Firefishy (as mentioned by Wiki admin Reneman).

Firefishy commented 7 years ago

The number of jobs is very low: https://wiki.openstreetmap.org/w/api.php?action=query&meta=siteinfo&siprop=statistics

Firefishy commented 7 years ago

I have started a full refreshLinks.php run.

verdy-p commented 7 years ago

note that the number of jobs is bvack to very high: these are all the pending changes of defered tasks that have failed and been converted to background jobs that also failed and are to be restarted very slowly up to 3 times before they are discarded. All these failed jobs should give lot of error messages in the MEdiawiki server log. Once they will have been failed 3 times, they will be abandoned definitely. The refreshLinks.php script will need to be used again and again with this version 1.28 of Mediawiki to solve what background jobs are actually not doing due to the ongoing unsolved bug in this version. This is not just a one time fix, you'll need to schedule that refreshLinks.php script continuously, and you should have seen that it causes lot of server activities (it forces all pages to be loaded and parsed again but then it performs many SQL requests to patch the contents of backward links tables (that script will refresh multiple tables and their index for various features, not just the table of category member lists, that are all affected by the issue in Mediawiki background job executor

Note also this link to a summary of some known issues in Mediawiki related to the background jobs (those that are created converting deferred tasks that have failed), this is a backup subpage of my talk page with other people involved with the problem and that contacted me, and for which I made some investigations:

https://wiki.openstreetmap.org/wiki/User_talk:Verdy_p/Archive_2017_Jun

You'll see a lot of links to issues reported to Mediawiki's own issue tracker (Phabricator), where there are many people discussing the problem on various wikis (not just those of Wikimedia). The issue is quite complex and has multiple internal causes, some of them still not completely understood, and with several attempted patches that have failed to resolve it (almost all patches have created further issues elesewhere).They started being remported in last November 2016, in MediaWiki 1.27 but there were easy workarounds for them, no longer working in MediaWiki 1.28. Versions 1.29 and now 1.30 have been created recently since June trying to patch the problem (but adding new ones).

I add this page because of the links included, that could help find solutions with other MediaWiki server admins and MediaWiki developers. Wikimedia itself has its own list of issues for their deployment in their hosted projects, but they are much more complex due to their advanced configuration on their complex servers farm, and with their own Mediawiki extensions (they have much more extensions than those we have here, but also their own sets of specific admin tools, developed separately of Mediawiki and many of the solutions they find there cannot be used in most other wikis, they also use another PHP engine, another SQL engine)

HolgerJeromin commented 7 years ago

We have still problems. This new category seems empty, but we have some items in it.

The wiki is not fully operating in other parts as well as the mentioned file seems to be unused, but it is used here.

verdy-p commented 7 years ago

I had already alerted that runing the "updateLinks.php" only once would not be enough, because Mediawiki still does not run the background jubs with the current deployed version.

I has asked to admins to reschedule the job regularly (like Wikimedia already does on its own servers because of the same defect, but much more agressively than what we can do on the OSM wiki server). My opinion is that it should rerun weekly, with a server admin looking at job logs to detect further problems. It could run for example in the night between Sunday and Monday, with an admin looking at logs on Monday evening or the next morning to detect possible performance problems and help tune the possible parameters (notably the batch size) to avoid blocking too much resources on the server for too long (ideally, each batch should not be longer than a couple of hours an average, allowing the job to be rescheduled every 3-4 hours to go to the next batch of page numbers)

Some basic shell script can be used to detect if a job is still running (to avoid launching another one in parallel), this shell script can then be schedule to run every hour to see if there's something to do and determine if a new range of page number can can be processed.

On Wikimedia wikis, the refreshLinks.php script uses small batch sizes but new batches are submitted very frequently (every 1 minute or 2 minutes if there's some other priority or other processes needing more resources, notably during full backups or during deployment of a new software version). Wikimedia also has less problems because their SQL database is much more powerful on their highly redundant farm of servers, and they also have site-specific scripts to tune and distribute the work on this farm.

But apparently on OSM wiki, the designated server admins are too busy to do something else: OSM should recruit some other admins (possibly even granting admin privileges to voluteer Wikimedia admins that could help, but would work here under OSM Foundation policy and possibly a contractual agreement.

gravitystorm commented 7 years ago

@verdy-p Thanks for your input. Yes, all the server admins are very busy, and we're always looking for more help. The best way to solve this problem is to make whatever changes are necessary (including creating cron entries to run the script you mention) to our chef repository.

If nobody steps up and creates a fix, then it won't get fixed. You don't need to be a member of the Sysadmin team to fix the problem! Anyone can submit the fix and we'll help anyone who wants to work on it.

If anyone would like to join the Sysadmin group, then please have a look at the Sysadmin Membership Policy. We'd certainly appreciate the help.

verdy-p commented 7 years ago

It's too difficult to create and tune such script without a direcct view on what is currently running on the host, and have basic knowledge of the deployment options and connectivty, given it is alkmost undocumented and has changed repeatedly without the current sysadmins reporting what was done. Even running a simple server-side script can cause unexpected loads that would prohibit other tools to run concurrently. Finally this may just be a VM within a larger server that has also other VMs competing for performance, and the attached storage may be also on a shared resource. However the script itself sollicitates mostly the SQL server to perform updates, even if it also uses the MediaWiki and PHP software to reparse the pages. And as there's no hint at all about which pages should be reparsed, all wiki pages will be read/parsed. The more the wikis grows, the longer it will take to complete, and such script requires some monitoring (at least some logs, that need their own storage, and some status files to contol what it does and when, or to tune the number of pages to parse in a single run before rescheduling: that scipt should not be run concurrently in another instance before the first one has completed one pass. As as, the "refreshLinks.php" script does not contain anything to control that.

As well you may alareayd have your own scheduling rules and "cron" may not be the only schduling agent (on Linux, there's also "systemd", and there are agents in the SQL installation itself, plus some agents for the PHP engine itself). Such script may also be dependant on the version of Linux you use.

It's not compelx to write, but it first requires intimate knowledge of the system and elevated privileges on it, in order to open a remote SSH option on it (establishing the session itself may have additional requirements and may not work with all SSH clients).

So I don't see how to help you here, only the existing admins that have installed the wiki know what they can do, and need to contact Mediawiki support if needed.

grinapo commented 7 years ago

If I understand correctly a patch for chef running the script weekly would help the situation somewhat. (The admins could be commanded to share the log and it can be analysed from outside.) I don't use chef so I'm not eager to do it myself.

matkoniecz commented 7 years ago

Is setting https://www.mediawiki.org/wiki/Manual:RefreshLinks.php to be run in https://github.com/openstreetmap/chef/tree/4670fb28130f0c5f6371ddc4932bf856725ee589/cookbooks/wiki regularly is a proper solution?

verdy-p commented 7 years ago

For now scheduling this script to run regularly is the only option. This was actually never said in the release of MediaWiki 1.28 as a "stable" version. This is really a severe bug that breaks many wikis whose categories (notably) and "what links here" feature no longer works at all (defered jobs are simply not running at all, they fail, and they fail again when they are transformed to background jobs supposed to run asynchonously. Only running "RefreshLinks.php" script regularly (by a system admin) currently solves the problem temporarily, but this requires a permanent active work and constant prensecne of a system admin: you cannot run it correctly from the admin web interface (it takes too long to execute, the session is lost before the end, it can fail in the middle and has to be restarted and you need to read logs to see what remains to do), and otherwise requires access to the underlying shell console on the host (difficult to do when this access is very restricted).

MediaWiki 1.28 is a disaster for the maintenance of many wikis that have few sysadmins or whose sysadmins should not be required to be constantly monitoring such thing. It should have NEVER been said to be a "stable" release, even if it works in Wikimedia wikis with their hundreds of sysadmins, and lot of local tweaking! further version 1.29, 1.30+ have still not solved the problem at all.

Only Mediawiki 1.27 should remain considered "stable". Consider saying to people that they should revert to MW 1.27 if they managed to upgrade to this non-working version 1.28 which severely complicated the maintenance of wikis, and keep version 1.27 maintained.

Alternative: make a separate branch in more recent versions that will be able to run without the defered tasks that were added in 1.28 (notably the update of category members, and the "what links here": this should still run synchronously be default, this actually does not take much time to perform the new necessary SQL requests to some indexing table !)

Tigerfell commented 5 years ago

@Firefishy Can you trigger refreshLinks.php, please or check if the internal logs tell you anything more than what I know (queue length > 630 k, categories not updating anymore).

Firefishy commented 5 years ago

Should be fixed now by https://github.com/openstreetmap/chef/commit/4aa20a01bd0c0c1e2e1814cccfd4549cba89eee0

Tigerfell commented 5 years ago

Thanks for fixing. This works pretty well now and surely better than before. :+1: Unfortunately, I encountered the category "Soft-redirected categories still containing pages", which seems to have stopped updating. Maybe this job was lost when the queue was very long (and there are potentially similar categories)? I could not find an error in the template which assigns this category and the pages themselves show the correct categories, just the list is wrong.

Firefishy commented 5 years ago

Now running refreshLinks.php weekly in https://github.com/openstreetmap/chef/commit/337b09717d393293f42859c0623ca588c3692ea5

Tigerfell commented 5 years ago

Great! Now it is fixed. Thanks!

Tigerfell commented 2 years ago

There was a huge spike in the queue length on 13 February 2022. Since then, the categorisation is essentially broken. Some pages are missing in their categories, others are in wrong ones. There is a discussion at Talk:Wiki about potential causes and fixes. We hoped that the weekly cron job would fix the issue. This was not the case. I guess we need some manual intervention. Maybe run the refreshLinks script and review the output?