github / gh-ost

GitHub's Online Schema-migration Tool for MySQL
MIT License
12.43k stars 1.26k forks source link

UCS2 to UTF8MB4 Encoding error #1438

Open dwash96 opened 4 months ago

dwash96 commented 4 months ago

Not sure if this is expected for fixed-length to variable length charset conversions but for rows read from the binlog into the gho table during the migration, the utility does not do an equivalent character conversion, it just uses the bytes given with the new charset. I could see why this behavior might make sense (charsets may not necessarily have truly equivalent representations of the same displayed character?) and is a skill issue in my part but in case this is not expected behavior, the example I saw during my migration was as follows:

"M" in UCS2: 0x004D "M" in UTF8MB4: 0x4D

I am in the process of fixing it in my DB by running UPDATE statements in the format:

{column} = IF(CAST({column} AS BINARY) LIKE CONCAT(0x00, '%'), CONVERT(CONVERT(CAST({column} AS BINARY) USING UCS2), USING UTF8MB4), {column})

With some appropriate filters to not have to attempt to update every row of the new table of course. Is this something that gh-ost should handle itself? I think the format above allows specifically UCS2 to UTF8, if it were ran against the gho table rows upon insertion/update from the binlog but it's probably not portable outside this specific case.

Either way, this is a fantastic tool and I really appreciate that you all made this. Thanks