Open ghost opened 6 years ago
The solution that seems the most compatible is to remove long byte sequences (no-utf8), the following code is doing this :
public function sanitizeXML($string)
{
//reject overly long 2 byte sequences, as well as characters above U+10000 and replace with ?
$string = preg_replace('/[\x00-\x08\x10\x0B\x0C\x0E-\x19\x7F]'.
'|[\x00-\x7F][\x80-\xBF]+'.
'|([\xC0\xC1]|[\xF0-\xFF])[\x80-\xBF]*'.
'|[\xC2-\xDF]((?![\x80-\xBF])|[\x80-\xBF]{2,})'.
'|[\xE0-\xEF](([\x80-\xBF](?![\x80-\xBF]))|(?![\x80-\xBF]{2})|[\x80-\xBF]{3,})/S',
'?', $string );
//reject overly long 3 byte sequences and UTF-16 surrogates and replace with ?
$string = preg_replace('/\xE0[\x80-\x9F][\x80-\xBF]'.
'|\xED[\xA0-\xBF][\x80-\xBF]/S','?????', $string );
return $string;
}
(source: https://magp.ie/2011/01/06/remove-non-utf8-characters-from-string-with-php/)
There is another solution removing the 4 bytes sequences, however impacting the line breaks/eol..
preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $string);
The issue might be in short description, long description and meta description.
We also had a case where an error for an invalid UTF 8 sequence is triggered for the following sequence : 0x1E 0x74 0x69 0x73
. Our investigation concluded that the sequence is valid (not a 4 byte sequence) but that 0x1E
isn't escaped properly.
FYI ♫ is only 3 bytes. Use ┻ for a 4 bytes sequence.
Expected behaviour
Non UTF8 characters shouldn't be exported in the XML feed.
Actual behaviour
Non UTF8 characters are exported, creating errors when interpreting the XML feed.
Steps to reproduce the behaviour
Add a non UTF8 character in
short_decription
such as♫