neitanod / forceutf8

PHP Class Encoding featuring popular Encoding::toUTF8() function --formerly known as forceUTF8()-- that fixes mixed encoded strings.
1.63k stars 367 forks source link

Didn't work with some languages for example CZECH #91

Closed lianglee closed 3 years ago

lianglee commented 3 years ago

Example text:

Dle prohlášení Novavax bude tato továrna schopna produkovat až 1 miliardu vakcín ročně již od října. V srpnu spolu s velvyslancem USA navštívíl Andrej Babiš tuto továrnu na výrobu vakcín. Otázkou zatím zůstává, proč se toho Andrej Babiš zúčastnil, pravděpodobně (spekulace) má jeho Agrofert nějakou spojitost buď s dodávkami nebo subdodávkami.

...::fixUTF8($text)

Outputs with ? question marks for some chars example ž returned as question mark ?

garrettw commented 3 years ago

What encoding is the text? This library only works with Latin1 (ISO 8859-1), Windows-1252, and UTF8, as the readme says.

lianglee commented 3 years ago

@garrettw the string seems utf8

var_dump(mb_detect_encoding($string)); string(5) "UTF-8"

One other observation is if i use fixUTF8 the string become with question marks but running toUTF8 on UTF8 string gives correct result. Now how to know which one should we use ? I mean how we know if string is garbled ? as its dynamic string.

ThibautSF commented 3 years ago

Maybe try to use iconv.

With the iconv translit parameter, fixUTF8 return a string without question marks, but... some character are replaced by another : ie č => c or ř => r

$str = 'áíýšžčěřůúď';
$strf = Encoding::fixUTF8($str, Encoding::ICONV_TRANSLIT);
var_dump($str);  //string(22) "áíýšžčěřůúď"
var_dump($strf); //string(17) "áíýšžceruúd"

Note: It's only a workaround and not a solution (due to some data loss)

lianglee commented 3 years ago

thanks!