openstreetmap / operations

OSMF Operations Working Group issue tracking
https://operations.osmfoundation.org/
98 stars 13 forks source link

standard UTF8 encoding for MediaWiki databases #373

Open Tigerfell opened 4 years ago

Tigerfell commented 4 years ago

I would like to ask you to change the character encoding of the MySQL databases for MediaWiki installations to utf8mb4 encoding. It looks like they currently use utf8 which means that it uses three Bytes to store a character. It is not a real UTF8 encoding, many characters can not be stored. This results in wiki pages being trimmed when someone enters a non-supported character [1]. There are currently two use cases which require "standard" UTF8 encoding.

Additionally, the current encoding is deprecated according to MySQL 8 documentation [3] and will be removed. The wiki currently uses MySQL 5.7.29.

[1] https://wiki.openstreetmap.org/w/index.php?title=Bot&diff=prev&oldid=1784135 [2] https://wiki.openstreetmap.org/wiki/MediaWiki:Gadget-HotCat.js [3] https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-sets.html

tomhughes commented 4 years ago

As I said on email a week ago, doing the conversion is easy enough though it will need about half an hour of downtime for the main wiki.

My big concern is how to make sure that any new schema changes made by future mediawiki updates use the new encoding - the current tables do not appear to be using the default database encoding which suggests that mediawiki is setting the encoding explicitly when it creates tables and columns.

matthewdarwin commented 4 years ago

Set the the default in the server settings config file?

default-character-set = utf8mb4
tomhughes commented 4 years ago

Which will achieve what exactly?

I've already explained that the existing tables have a character set that does not match the current default character set, so I have to assume that mediawiki specified a character set explicitly when creating them, and would do so again for any new columns or tables.

Changing the default would do nothing at all to change that.

tomhughes commented 4 years ago

Looks like mediawiki actually rejected using utf8mb4 (https://phabricator.wikimedia.org/T50767) and actually reccomend using the binary encoding so that mysql just preserves whatever bytes mediawiki throws at it.

That is of course an insane solution, so probably what you'd expect from mediawiki :-(

tomhughes commented 4 years ago

If I understand https://phabricator.wikimedia.org/T196092 correctly then traditionally mediawiki had an option to enabled UTF-8 during installation which would cause to explicitly use the utf8 encoding for tables but that have now removed that option.

I just need to figure out how that impacts upgrades to existing installations and how to change it if necessary...

tomhughes commented 4 years ago

See also https://phabricator.wikimedia.org/T194125 where they seem to suggest that utf8mb4 won't actually work. That was closed with a reference to https://phabricator.wikimedia.org/T191231 as the solution and that is still open :-(

1ec5 commented 3 years ago

This results in wiki pages being trimmed when someone enters a non-supported character

As a workaround, the user can invoke this Scribunto module with Unicode codepoints.

(So that others can find this issue more easily: this issue tracks the inability to insert literal emoji and certain CJK characters, among other things.)

pnorman commented 1 year ago

This will likely require dumping the wiki and reloading it and needs a priority set.

Firefishy commented 1 year ago

This will likely require dumping the wiki and reloading it and needs a priority set.

Mediawiki does not have an internal backup method which backups all users and content. The documentation on mediawiki.org backup method is to dump a copy of the database which would not fix any encoding issues. [DumpBackup.php]

DumpBackup.php quote: "XML dumps contain the content of the wiki (wiki pages with all their revisions), without the site-related data. DumpBackup.php does not create a full backup of the wiki database, the dump does not contain user accounts, images, edit logs, deleted revisions, etc."

Firefishy commented 1 year ago

This ticket has no concrete description of what action is needed. Please describe the steps required or point to documentation which shows what needs fixing.

The relevant LocalSettings.php settings are described here. We have it set to binary

Firefishy commented 1 year ago

The closest thing I can find to "document" the change required is this throw away comment about running quote: To migrate to binary run: ALTER TABLE table CONVERT TO CHARACTER SET binary; on all your tables

I need proper documentation.

tomhughes commented 1 year ago

That sounds very wrong - isn't this about moving from mysql's weird utf8 to the more standard (but oddly named) utf8mb4?

Firefishy commented 1 year ago

That sounds very wrong - isn't this about moving from mysql's weird utf8 to the more standard (but oddly named) utf8mb4?

My understanding is they gave up on getting the extended unicode working with MySQL's utf8mb4 etc and went with an internal implementation in PHP instead, while using binary in MySQL. There appears to be no proper documentation on "fixing" an existing database and I am not going to haphazardly proceed without.

1ec5 commented 1 year ago

Correct, here’s a MediaWiki developer explaining some of the history behind this unusually complex implementation.

Firefishy commented 1 year ago

I would absolutely love to fix this issue, but I need some more clear guidance on what the issue is and what the fix might be. If any data is needed I can extract it.