Set table collation to match core tables

colemanw commented 7 months ago

Overview

Fixes inconsistent collation between extension tables (including core extensions like SearchKit) and core tables. See discussion at https://chat.civicrm.org/civicrm/pl/g3jt55akb3rk8dzgtqcr3bjjqy

Before

All core tables are ENGINE=InnoDB DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci ROW_FORMAT=DYNAMIC;
All extension tables are ENGINE=InnoDB

After

All core and extension tables are ENGINE=InnoDB DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci ROW_FORMAT=DYNAMIC;

colemanw commented 7 months ago

@totten it seems that this humble patch has sparked a much bigger discussion full of utopian dreams and aspirations... but meanwhile this is still causing problems in the real world. For example, this latest report on the symptoms.

So while I really do like the direction of the utopian discussions, how about we merge this in the meantime.

totten commented 7 months ago

@colemanw OK, I want this to makes sense. On general principle, the thing to consider is whether this will be bouncy (fix one environment, break another). My best rationalization in favor it is:

The current rules for character-set/collation in civicrm-core originated in 5.33 (https://github.com/civicrm/civicrm-core/commit/828c69155ec55d6abf50db8a8fc0f92e838faacf#diff-f73a612034e9487c9dcca2569aeb7f42ce8e03a8932fbf51d4a5ad90e14c388eL5).
Since civix moved to <upgrader> tag, the generated entities have required civicrm-core@5.38+.
So if you're generating ext-SQL, then you're running on 5.38+, and the collations match... right?

But looking at the change from 5.33... and looking at current upgrader/status-checks... it doesn't look like we ever fixed the character-sets/collations of databases on historical data. If you originally installed CiviCRM 4.6 or 5.0 or 5.10 (or anything up to 5.33), then all your tables were initialized with utf8/utf8_unicode_ci, and nothing ever prompted you change. So if you install an extension that forces utf8mb4/utf8mb4_unicode_ci, then the schema will be mismatched. Right?

Doesn't this just shift the bad-schema-situation to different users?

(cc @seamuslee001 b/c I think you all followed the history of utf8(mb4) more)

Aside: I think I have a better way to maintain/hook-into the CRM_*_Upgrader and provide auto-generation. Pushing on that more today...

colemanw commented 7 months ago

Doesn't this just shift the bad-schema-situation to different users?

Well, I think that ship might have already sailed. Because if we wanted to avoid forcing the collation when we add new tables, then why did we do this?

https://github.com/civicrm/civicrm-core/blob/b11418429c627738ffaa9f2face1d6c68af3bd61/CRM/Upgrade/Incremental/sql/5.47.alpha1.mysql.tpl#L11

And this?

https://github.com/civicrm/civicrm-core/blob/3a3d7c89e1f4d9d45fb21a6a9bf769c4d4a361da/CRM/Upgrade/Incremental/sql/5.50.alpha1.mysql.tpl#L20

mattwire commented 7 months ago

@colemanw @totten I remember a time when we had specified UTF8 in lots of extensions and it caused major problems when switching systems to utf8mb4 (because the SQL forced the extension to UTF8 instead of using the database default which was set to utf8mb4). So the decision at the time was to specifically exclude the database collation from the SQL and rely on you setting the default collation on the database once only. That allows for an easy future change of collation and also makes it easier if someone decides they need to run with another collation. I think there may have been some issues with forcing ROW_FORMAT as well. @artfulrobot I seem to remember you having some thoughts around this.

colemanw commented 7 months ago

@mattwire you're correct about the history, but that has left us in a place of sometimes relying on db default and sometimes not. After a long discussion about it on ~dev this is what we've come to:

Civi core always forces collation when creating new tables, even when adding them to existing sites. and civix-generated extensions do not. For some sites this inconsistency causes serious problems. So 3 steps toward the ideal fix of both consistency and configurability would be:

Stop the bleeding (this patch).
Consistently generate install sql from a centralized place in both core and extensions.
Stop generating sql files on disk. Generate & run sql in-memory during install/upgrade/uninstall so we never hard-code collations. This opens the door for a configurable setting for db encoding/collation (which could be a specific value like core does, or "none" to respect db default like extensions do). See this project which is making good progress toward that goal.

mattwire commented 7 months ago

I'm all in favour of 2 and 3. Just not sure that 1. is a good idea. I have over 100 extensions on my local dev environment and none of them specify database charset/collation in their SQL except some really old ones that specify utf8. I'd rather not have to go through a round of updating sql files only to be replaced with 2 and 3 and potentially require another round of updates. I agree this is sometimes a problem and it's one of the checks I always do when taking on an existing CiviCRM site. I check and update all tables to utf8mb4_unicode and make sure that is the database default. Note I've had problems before with a mixture of utf8mb4_general and utf8mb4_unicode as drupal seems to prefer the first one.

mattwire commented 7 months ago

An aside to this, and probably more important as it's certainly tripped me up a few times is when extensions have tables that don't start with civicrm_ -eg. cividiscount_ and civirule_ as these tend to get ignored by things like System.utf8conversion and detailed logging.

colemanw commented 7 months ago

Cool.

I guess I didn't see this PR so much as "require a bunch of extra work from ext maintainers" as "prevent extra work or regressions" because at this point we've patched all core extensions to be consistent with core but if we don't merge this update and someone reruns civix generate:entity-boilerplate then those fixes will be automatically undone.

I think it's otherwise fine for you to sit on your 100 extensions and not update them until the full fix is ready.

artfulrobot commented 7 months ago

I have this note to self:

Note never use utf8mb4_general_ci, prefer utf8mb4_unicode_ci. If there's one utf8mb4_0900_ai_ci then even better because it is (a) a UNICODE standard algorithm, and (b) it includes accent insensitivity, which is useful in the same way that case insensitivity is. (But I say this as an English speaker who rarely uses accents, can't find them on the keyboard sometimes, so this might be a narrow viewpoint).

As for ROW_FORMAT I'm OK with that as a default, as long as we're allowed to change it. I save megagigapetaexa bytes by choosing COMPRESS, for example, on tables like Activities which has rare updates and lots of additions with similar data. It makes searching Activities table much faster too as the db can load more rows/page. I discovered/developed this change when a particularly busy site crashed the db through an extension that was creating excessive activities - a disk-full crash on a database is not a nice thing to recover from.

colemanw commented 7 months ago

when extensions have tables that don't start with civicrm -eg. cividiscount and civirule_ as these tend to get ignored

Gawd why do we bother maintaining a complete list of all dao tables when stuff like that just ignores it in favor of string-fu?

colemanw commented 7 months ago

@artfulrobot I save megagigapetaexa bytes

Wow that's a lot! By my calculations that's 10^6 * 10^9 * 10^15 * 10^18! 🤓

But yea we should take into account that some super-smart admins might want to change row format and not be too restrictive about that. Or if we were super-smart maybe we could come up with some better defaults for that setting per-table.

mattwire commented 7 months ago

when extensions have tables that don't start with civicrm -eg. cividiscount and civirule_ as these tend to get ignored

Gawd why do we bother maintaining a complete list of all dao tables when stuff like that just ignores it in favor of string-fu?

Yeah.. then we have things like CRM_Core_DAO::getTableNames() that just returns every table that starts with civicrm_ and I think that AllCoreTables only returns tables linked to an entity - so possibly including non standard names?

artfulrobot commented 7 months ago

Wow that's a lot! By my calculations that's 10^6 * 10^9 * 10^15 * 10^18! 🤓

And yet it's still a tiny number compared to the possible moves in a game of Go - although to be fair, that number is calculated to exceed the number of atoms in the universe, so... :laughing:

totten / civix