shoppingflux / shoppingfluxexport

7 stars 7 forks source link

Remove non-UTF8 characters #366

Open ghost opened 6 years ago

ghost commented 6 years ago

Expected behaviour

Non UTF8 characters shouldn't be exported in the XML feed.

Actual behaviour

Non UTF8 characters are exported, creating errors when interpreting the XML feed.

Steps to reproduce the behaviour

Add a non UTF8 character in short_decription such as

ghost commented 5 years ago

The solution that seems the most compatible is to remove long byte sequences (no-utf8), the following code is doing this :

    public function sanitizeXML($string)
    {
        //reject overly long 2 byte sequences, as well as characters above U+10000 and replace with ?
        $string = preg_replace('/[\x00-\x08\x10\x0B\x0C\x0E-\x19\x7F]'.
         '|[\x00-\x7F][\x80-\xBF]+'.
         '|([\xC0\xC1]|[\xF0-\xFF])[\x80-\xBF]*'.
         '|[\xC2-\xDF]((?![\x80-\xBF])|[\x80-\xBF]{2,})'.
         '|[\xE0-\xEF](([\x80-\xBF](?![\x80-\xBF]))|(?![\x80-\xBF]{2})|[\x80-\xBF]{3,})/S',
         '?', $string );

        //reject overly long 3 byte sequences and UTF-16 surrogates and replace with ?
        $string = preg_replace('/\xE0[\x80-\x9F][\x80-\xBF]'.
         '|\xED[\xA0-\xBF][\x80-\xBF]/S','?????', $string );

        return $string;
    }

(source: https://magp.ie/2011/01/06/remove-non-utf8-characters-from-string-with-php/)

There is another solution removing the 4 bytes sequences, however impacting the line breaks/eol.. preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $string);

ghost commented 5 years ago

The issue might be in short description, long description and meta description. We also had a case where an error for an invalid UTF 8 sequence is triggered for the following sequence : 0x1E 0x74 0x69 0x73 . Our investigation concluded that the sequence is valid (not a 4 byte sequence) but that 0x1E isn't escaped properly.

BarbUk commented 5 years ago

FYI ♫ is only 3 bytes. Use ┻ for a 4 bytes sequence.