matomo-org / matomo

Empowering People Ethically with the leading open source alternative to Google Analytics that gives you full control over your data. Matomo lets you easily collect data from websites & apps and visualise this data and extract insights. Privacy is built-in. Liberating Web Analytics. Star us on Github? +1. And we love Pull Requests!
https://matomo.org/
GNU General Public License v3.0
19.89k stars 2.65k forks source link

Add configuration for database connection collation #22564

Closed mneudert closed 1 month ago

mneudert commented 2 months ago

Description:

The MariaDB 11.5 release introduced a change in the default Unicode collation.

Due to the way Matomo connects to the database (i.e. sending SET NAMES without a COLLATE parameter), this can lead to a change in the collation used in some queries.

If the database was created with a previous MariaDB version, like 11.3, and the charset setting was configured as utf8mb4, the effective collation after SET NAMES should be utf8mb4_general_ci. This collation will then also be used to create the archive tables like archive_blob_2024_08.

With MariaDB 11.5, the collation will, and this may depend on the individual server configuration, change to utf8mb4_uca1400_ai_ci. And this collection will then be used to create a new archive table.

The problems arise when queries are using variables, for example during archiving:

SELECT name
FROM archive_blob_2024_08
WHERE idarchive IN (1,2,3) AND (name = 'Actions_sitesearch' OR (name LIKE 'Actions_sitesearch%' AND (SUBSTRING(name, 19, 7) = '_chunk_' OR (SUBSTRING(name, 20, 1) >= '0' AND SUBSTRING(name, 20, 1) <= '9'))))
ORDER BY CAST(IF((@idsubtable := SUBSTRING(SUBSTRING(name, IF(SUBSTRING(name, 19, 7) = '_chunk_', 26, 20)), 1, IF((@secondunderscore := LOCATE('_', SUBSTRING(name, IF(SUBSTRING(name, 19, 7) = '_chunk_', 26, 20))) - 1) < 0, LENGTH(name), @secondunderscore))) = '', -1, @idsubtable ) AS SIGNED) ASC;

If the table archive_blob_2024_08 was created using utf8mb4_general_ci, and the connection collation is set to utf8mb4_uca1400_ai_ci, the variable assignments in the WHERE clause will create a forbidden collation mix. And this breaks archiving.

While one way to work around that issue is to reconfigure the character_set_collations server configuration from utf8mb4=utf8mb4_uca1400_ai_ci to utf8mb4=utf8mb4_general_ci, this may not be possible in many environments, like shared hosting.

This PR introduces a new, optional, database setting connection_collation. If it is set alongside a database charset, this value will be passed to a SET NAMES ... COLLATE ... statement, setting the connection collation back to the value required for an uninterrupted service.

Config update

During installation and the next update, the config will be checked if a collation has been set.

If that is not the case, an automatic update is tried, based on the comparison of the collation of the user table and the one returned from SELECT @@collation_connection. If both are the same the configuration will be updated to this value.

If not, the update will check the most recent archive table (by name). If the collation of that table matches the users table, it will be chosen for the config update.

Otherwise no update will take place, so we don't accidentally break any setups. For this case the diagnostics have been updated to show the used connection collation and suggest updating the configuration manually to a suitable value.

Note: It can happen that, after an upgrade, a mix of database tables has been created. For example 2024_08 and 2024_09 could have been created with different collations. In this case the one of the tables has to be manually altered (or deleted and recreated by invalidation and rearchiving), so all tables have the same collation.

Fixes #22536 Refs DEV-18459

Review

sgiehl commented 2 months ago

@mneudert I'm wondering if we also should introduce a way to ensure that all tables are created with the same collation (like we do for TiDb). At the moment a tables collation is also based on the default collation of the database, not sure if that could cause problems as well at some point, if the default collation changes 🤔

mneudert commented 2 months ago

Good point @sgiehl.

I just checked how MariaDB behaves, and found that the current proposed fix is not enough :(

As we create table using DEFAULT CHARSET (without a COLLATE value), until now I would have expected the database default to be used. And that is a value that should not be changed by any database upgrade. But, as I found out, the creation in our case is still using the "unexpected" utf8mb4_uca1400_ai_ci collation for any new table.

So we indeed need to allow setting the collation in all create statements.

Should the config setting be renamed to the more generic collation, and be added to getTableCreateOptions?

sgiehl commented 2 months ago

Should the config setting be renamed to the more generic collation, and be added to getTableCreateOptions?

Guess that makes most sense. And during installation we should detect the current default collation in order to set it to the config. Not sure if we should also do that during the update, to ensure every installation has a collation set in the end.

michalkleiner commented 2 months ago

The changes related to tables creation look good. Are there still any changes pending related to the installation and automatic config setup?

DaAwesomeP commented 2 months ago

Would appreciate a patch release for this soon! My archive tasks are consistently failing for now.