vexim / vexim2

Virtual Exim 2
Other
71 stars 47 forks source link

Migrate MySQL database to `utf8mb4` (or ASCII where appropriate) #242

Open runout-at opened 7 years ago

runout-at commented 7 years ago

With following subject line i get R=ditch_hdrsubject defer (-1): condition check lookup defer

Subject: =?utf-8?Q?Fwd:_=E2=9C=94_Viel_Spa=C3=9F_beim_Erleben!?=

As far as i can interpret this issue, there is a problem with the lookup in the database with the UTF8 char =E2=9C=94 in the file /etc/exim4/conf.d/router/249_vexim_ditch_routers

For now i put condition = no in this router but this is not a solution.

Maybe this problem can also arise on other DB-lookups. I'm not sure if this can be related to the Mysql/Mariadb-UTF8 bug and a solution could be to move the database to UTF8mb4 encoding. As i have a lot of DBs on my installation it would be hard to change the charset.

rimas-kudelis commented 7 years ago

Hi! Could you try running exim4 -d -v your@email.address as root, and then submitting a simple message to the stdin, something like this:

From: nobody@example.org
To: your@email.address
Subject: =?utf-8?Q?Fwd:_=E2=9C=94_Viel_Spa=C3=9F_beim_Erleben!?=

Test email

And press Ctrl+D to signal end of input.

I wonder what Exim would tell you on the standard output, in the --------> ditch_hdrsubject router <-------- section. For me, I don't see anything suspicious, but I think I remember MySQL becoming more picky about character sets lately, so it might be that. Either way, it would be nice to see the actual MySQL error code or message to debug this properly.

For the reference, my database (and particularly the tables affected) is utf8, not utf8mb4.

runout-at commented 7 years ago

Sorry for my delay, have almost no internet until mid of september and i don't want to mess around with a production machine in this (remote) situation.

runout-at commented 6 years ago

This seems to happen in ditch_hdrto too: To: =?UTF-8?Q?Gy=c5=91XXX_XXXXX?= XXXXX.XXXXX@XXXXX.XX

i hope to find time in the next days to look deeper into this.

Udera commented 6 years ago

I think I tested on my setup with utf8_general_ci and didn't have a problem. Can you give more details about your system in order to reproduce this problem?

runout-at commented 6 years ago

latin1_swedish_ci ... it seems that has never changed since the first install of vexim - a long time ago. Most databases i have converted years ago and i don't know why not the vexim-db. I should do this anyways... thx for the hint - i will try.

Udera commented 6 years ago

There was a migration step for the database: https://github.com/vexim/vexim2/blob/master/setup/migrations/vexim_2.2_to_2.3_mysql.sql

runout-at commented 6 years ago

It seems converting the database solved the issue. There were no such problems for 2 weeks now.

Thanks a lot for the hint. I did not do the whole upgrade. I close this issue for now.

gldickens3 commented 6 years ago

Hello runout-at, Rimas and Udera,

I just upgraded one of my Debian servers to Debian 9 Stretch and and which upgraded the MySQL installation to MariaDB 10.1. I am running Exim4U which has the same ditch_hdrsubject router as Vexim and I am seeing similar behavior to what runout-at reported in the first post to this thread. That is, my exim logs show the error: "R=ditch_hdrsubject defer (-1): condition check lookup defer" with UTF8 encoding in email Subjects such as:

Subject: =?utf-8?B?8J+Ru/CfkrAgTW9uc3Ryb3VzIGhvdGVsIGRpcw==?= =?utf-8?B?Y291bnRzOiB1cCB0byA4MCUgb2ZmIQ==?=

As an FYI, my database was set to utf8 with utf8_general_ci collation. The default MySQL/MariaDB setup installed by the OS upgrade has the following in /etc/mysql/mariadb.conf.d/50-server.cnf:

character-set-server = utf8mb4 collation-server = utf8mb4_general_ci

I ran a test with the following test email using "exim4 -d -v recipient@domain.com":

From: sender@senderdomain.com To: recipient@domain.com Subject: =?utf-8?B?8J+Ru/CfkrAgTW9uc3Ryb3VzIGhvdGVsIGRpcw==?= =?utf-8?B?Y291bnRzOiB1cCB0byA4MCUgb2ZmIQ==?=

This is a test!

The standard output for my test from from the ditch_hdrsubject router included the following:

MYSQL: query failed: Illegal mix of collations (utf8_general_ci,IMPLICIT) and (utf8mb4_general_ci,COERCIBLE) for operation '=' lookup deferred: MYSQL: query failed: Illegal mix of collations (utf8_general_ci,IMPLICIT) and (utf8mb4_general_ci,COERCIBLE) for operation '='

In any event, my first attempt to solve this problem was to convert the entire database to utf8mb4, however this failed because of the domain field length (255) to be indexed was too large since the MariaDB limit is 1000 bytes and.each character requires 4 bytes to index in utf8mb4. In other words, 4 * 255 = 1020 which is greater than 1000. So, I solved the problem by simply setting the blocklists table to utf8mb4 with utf8mb4_general_ci collation using:

ALTER TABLE blocklists CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;

All other tables remain set to utf with utf_general_ci collation and this appears to completely solve the problem.

So, my question is: Do you guys think that its ok to have the data base set up with mixed character encoding where the blocklist is set to utf8mb4 and everything else is set to utf8? If not, then we will need to reduce the size of the domain field to 250 characters in order for the entire database to be set to utf8mb4 unless any of you guys have a better way to deal with this.

So far, I have only seen this on Debian 9 Stretch installations, however, I suspect that we will see this on other Unix/Linux distros whenever MariaDB is used.

I will be interested to hear what y'all think.

Thanks!

Gordon Dickens

runout-at commented 6 years ago

it would be possible to use just a part of the column for the index.

https://stackoverflow.com/questions/1814532/1071-specified-key-was-too-long-max-key-length-is-767-bytes

gldickens3 commented 6 years ago

Quoting from: https://lists.exim.org/lurker/message/20171109.141550.405b25b8.en.html

"Everyone who wants to use UTF-8 datatypes SHOULD use this character set (utf8mb4) to avoid nasty surprises."

rimas-kudelis commented 6 years ago

Huh, what a peculiar problem... Perhaps a solution to that would be to keep some of our columns in plain old ASCII? We don't have any magic for that at the moment, but I would expect host names to be always stored in ASCII form. I don't expect Exim, or any of its competitors to implement support for Unicode DNS names – it's just much more efficient from all points of view to store it in its expanded form once than to expand it every time it is needed. We could even convert these names between punycode and unicode where UI is involved, thus hiding this detail from the user. IMO, storing DNS name in ASCII would solve at least this particular issue.

runout-at commented 6 years ago

Changing all MySQL/MariaDBs to utf8mb4 is recommended anyways. The original issue of this thread has nothing todo what the user entered into the WebUI and which value is stored in the DB. The issue is that if a mail arrives and has some strange chars in the header fields (subject, from,...) then this will go into the DB-lookup-query in exim which causes the problem and we have no influence on what other people send to us.

Which chars we allow to be used in different input-fields is another story. I'm not happy with the actual RFC which allows all this strange characters and even some stupid syntax in localparts. Right now it is almost impossible to validate a localpart by regex.

rimas-kudelis commented 6 years ago

I might be wrong here, but my understanding is that we only get this error when the text in question (the decoded header text) may not be coerced to the collation used by the database (utf8 in this case).

For example, Gordon's quoted message has the following subject: "👻💰 Monstrous hotel discounts: up to 80% off!"

As you can see, there are a few emoji in the header. And I strongly suspect that emoji don't fit into the utf8. They do fit in utf8mb4 though, because that is the collation that can actually fit UTF-8 (at least as far as I know).

Your initial case was very similar: the e-mail you received had the subject "Fwd: ✔ Viel Spaß beim Erleben!", and the HEAVY CHECK MARK character in it could not be coerced to latin1 collation that the appropriate column of your database was using, so you got the error. After converting to UTF-8, your DB was able to handle that character properly, because it could then be successfully coerced to whatever collation you are using now.

Anyway, my point is that we should definitely use utf8mb4 where appropriate (such as the header value column), but for columns like DNS names, we can just as well use plain ASCII, because utf8mb4 gives us no advantage there (we have to store the punycode version of DNS name), whereas it does have the disadvantage of being too large to index. But for WEb UI, we should consider outputting and accepting DNS names in full Unicode, and converting them to/from punycode as necessary.

rimas-kudelis commented 6 years ago

In fact, I'm going to reopen this. Our current MySQL setup scripts use utf8 instead of utf8mb4, and that may cause issues, as Gordon demonstrated.

runout-at commented 6 years ago

I just migrated another DB from utf8 to utf8mb4 and did run into the specified-key-was-too-long-max-key-length-is-767-bytes problem. But i found an interesting solution to dump-modify-reimport a whole database.

The new server has following in the config:

default_storage_engine  = InnoDB
innodb_file_format      = Barracuda
innodb_file_format_max  = Barracuda
innodb_file_per_table   = 1
innodb_large_prefix     = ON

There is a new RAW_FORMAT=DYNAMIC. This will be default from Mariadb 10.2.2 on. But it is possible in earlier versions to use this. Tested with 10.1.26 (Debian Stretch).

Howto: Maybe the sed command could be shorter. It's just a proof of concept (created by try&error). I don't care if the string ' utf8 ' appears in a dumped row. This should be improofed.

First of all: BACKUP, BACKUP, BACKUP!

mysqldump -u <user> -p -h <oldmysql.server> -c -e --default-character-set=utf8mb4 --single-transaction --skip-set-charset --add-drop-database -B <databasename> > dump-utf8.sql
sed -e 's/DEFAULT CHARACTER SET utf8/DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci/' \
    -e 's/CHARACTER SET utf8 COLLATE utf8_unicode_ci/CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci/' \
    -e 's/COLLATE=utf8_unicode_ci/COLLATE=utf8mb4_unicode_ci/' \
    -e 's/DEFAULT CHARSET=utf8;/DEFAULT CHARSET=utf8mb4 ROW_FORMAT=DYNAMIC COLLATE=utf8mb4_unicode_ci;/' \
    -e 's/DEFAULT CHARSET=utf8 COLLATE/DEFAULT CHARSET=utf8mb4 ROW_FORMAT=DYNAMIC COLLATE/' \
    -e 's/DEFAULT CHARSET=utf8 COMMENT/DEFAULT CHARSET=utf8mb4 ROW_FORMAT=DYNAMIC COMMENT/' \
    -e 's/ DEFAULT CHARSET=utf8 / DEFAULT CHARSET=utf8mb4 /' \
    -e 's/utf8_/utf8mb4_/' \
    -e 's/utf8 /utf8mb4 /' \
    -e 's/ENGINE=MyISAM/ENGINE=InnoDB/' \
    dump-utf8.sql > dump-utf8mb4.sql
mysql -u <user> -p -v -h <newmysql.server> <dump-utf8mb4.sql
runout-at commented 6 years ago

Now i converted a vexim2 DB. It does not complain and seem to work :)

Udera commented 6 years ago

What exactly needs to be done?

I suppose for username, localpart, ... we do not allow emojis (if the username is not part of an email address we could technically allow it)?

runout-at commented 6 years ago

i would like to convert everything to UTF8mb4. It's not to much work and we do not to have to take care later on.

Converting all tables to InnoDB is very important. Maybe this is enough for the Index-length problem.

runout-at commented 6 years ago

It is important to set the ROW_FORMAT to DYNAMIC As i stated above this could be done by changing the CREATE TABLE statements or by altering existing tables.

MariaDB 10.2.2, MariaDB 10.1.32 and MySQL 5.7 are introducing a new config variable to set this system wide:

https://mariadb.com/kb/en/library/xtradbinnodb-server-system-variables/#innodb_default_row_format

innodb_default_row_format

    Description: Specifies the default row format to be used for InnoDB tables. The compressed row format cannot be set as the default.
    Commandline: --innodb-default-row-format=value
    Scope: Global
    Dynamic: Yes
    Data Type: enum
    Default Value: dynamic
    Valid Values: redundant, compact or dynamic
    Introduced: MariaDB 10.2.2, MariadB 10.1.32
runout-at commented 6 years ago

Documentation/Howto: https://github.com/vexim/vexim2/wiki/Mysql-ROW_FORMAT,-UTF8MB4

After merging the PR this issue could be closed.