t1gor / Robots.txt-Parser-Class

Php class for robots.txt parse
MIT License
83 stars 31 forks source link

UTF-8 content parsed incorrectly #71

Closed JanPetterMG closed 3 years ago

JanPetterMG commented 8 years ago

The robots.txt content is always converted to UTF-8, but the mb_* functions expects whatever the user think the encoding is.

Results in valid UTF-8 robots.txt files being parsed as the wrong encoding, witch further causes loss of valid rules.

In other words, it's like Russian roulette if perfectly valid rules are parsed correctly...

mb_internal_encoding("utf-8");
new RobotsTxtParser('', "iso-8859-1");
var_dump(mb_internal_encoding());  // string(10) "ISO-8859-1"