Synchronize Import turns accent text into gibberish characters

tikyon commented 1 year ago

I've notice during the synchronization, some text like Ta'an Kwäch'än Council would turn gibberish Ta'an KwÃ¤ch'Ã¤n Council.

jegelstaff commented 1 year ago

Ah yes! The horrible issue of character sets. The latest code has fixes for this. In particular, commit 1429c6852 on Nov 30. If you're already using that code, then perhaps there's more to the story.

jegelstaff commented 1 year ago

Note that the export and import have to be done with the new code. The key thing is that everything be encoded in utf8mb4 which supports everything, all accents, characters, even emojis.

jegelstaff commented 1 year ago

A good summary of why this is so horrible: https://kunststube.net/encoding/

tikyon commented 1 year ago

I am using same code as https://github.com/jegelstaff/formulize/commit/1429c6852d911c67b93739a5d94d7a983087d718

jegelstaff commented 1 year ago

Hmmmm. I hate character encoding. Maybe we can make some time next week to examine this together?

jegelstaff commented 1 year ago

Ohhhh, could also require the database tables to be encoded with utf8mb4. Or the collation of the tables. Or something. If the table is not setup to store the info, you can encode it all you want in the query, but the table with garble it.

jegelstaff commented 1 year ago

Even the fields, you could have a utf8mb4 table. But a field in the table that is encoded differently. It's all a conspiracy to drive us mad.

tikyon commented 1 year ago

For function cleanEncoding() in file modules/formulize/include/synccompare.php, I think you have a typo where $value is used instead of $text on line 169?

    if(mb_detect_encoding($value,'Windows-1252,ISO-8859-1')) {

I tried changing line 169 to if (mb_detect_encoding($text,'UTF-8,Windows-1252,ISO-8859-1') != 'UTF-8') and it stopped converting the example text above.

jegelstaff commented 1 year ago

Wow. Yup, that's a typo.

Curious though, because I could have sworn this did work as it was, on a site with accented characters. I will verify this next time I am doing a migration there.

However, I would not include UTF-8 in the list of encodings you're trying to detect. The function mb_detect_encoding will return boolean false if none of the declared encodings are detected, and you only want to carry on with the conversion if the encoding is one of the other two. So the original line, just with $text instead of $value should be correct.

Now, if you find that it doesn't work unless you have it written the way you have it above, then that suggests that you have some other strange encoding going on, and not Windows-1252 or ISO-8859-1, and we should get to the bottom of what is really going on and what coding is in effect in that case.

I have made a commit with the typo corrected: 5196227db

And the most recent code in the Master branch has a couple other small fixes too, related to conditional "text for display" elements, and also for when radio buttons are used to trigger conditional elements.

tikyon commented 1 year ago

For the example above, mb_detect_encoding("Ta'an Kwäch'än Council",'Windows-1252,ISO-8859-1') would evaluate to 'Windows-1252'. So then it will do the conversion in the next line giving the "Ta'an KwÃ¤ch'Ã¤n Council".

The current collation used for the tables are utf8_general_ci.

jegelstaff commented 1 year ago

Oh interesting. So if UTF-8 is not given as an option, then you get a false positive of the closest encoding it can figure out? Huh.

Okay, so yes, then looking for UTF-8 makes sense, but I would modify the condition slightly to:

$encoding = mb_detect_encoding($text,'UTF-8,Windows-1252,ISO-8859-1'); if($encoding AND $encoding != 'UTF-8') {

What do you think?

tikyon commented 1 year ago

Yes, your suggestion works for me.

jegelstaff commented 1 year ago

Done. It's committed now. And I checked the docs on the function more, and your original line is fine as is, because when not in "strict" mode (which requires a third parameter set as TRUE), one of the declared encodings will always be returned, whichever matches most closely. So no need to worry about a FALSE return value.

This raises the interesting question of whether we should do a strict match or not. If the text sort of matches Windows-1252 but not really, do we still want to do a conversion to UTF-8? And possibly garble the text?

I think the chances of a mangled, undetectable encoding are low. And if there is a non-standard encoding, then almost certainly the text will not be what we want it to be anyway because it was screwed up already? So I'm content to leave it as is and not do a strict comparison.

jegelstaff / formulize

Synchronize Import turns accent text into gibberish characters #344