matomo-org / matomo

Empowering People Ethically with the leading open source alternative to Google Analytics that gives you full control over your data. Matomo lets you easily collect data from websites & apps and visualise this data and extract insights. Privacy is built-in. Liberating Web Analytics. Star us on Github? +1. And we love Pull Requests!
https://matomo.org/
GNU General Public License v3.0
19.74k stars 2.63k forks source link

The string to escape is not a valid UTF-8 string in "@CoreHome/getDefaultIndexView.twig". #4410

Closed mattab closed 8 years ago

mattab commented 10 years ago

Reported in the forum: http://forum.piwik.org/read.php?2,108645

There was proposed solution: http://forum.piwik.org/read.php?2,108645,page=1#msg-108949


by setting:

[database]
charset = utf8
tsteur commented 8 years ago

I can't see any double encoding somewhere. Also compared hex codes etc and the character ä looks fine. Does it also occur when using different comparisons like equal, contains not(enthält nicht), etc.? Maybe you could also try for a test to remove the |e('html_attr') from https://github.com/piwik/piwik/issues/4410#issuecomment-208555027 and see if it works?

It would be really great if someone could offer access to a server to debug this issue.

garvinhicking commented 8 years ago

@tsteur Ok I solved it:

In SegmentFormatter.php the getTranslationForcomparison() method uses "strtolower" to transform "Enthält" into "enthält". For that, it utilizes the "strtolower" PHP function, which is by default* not UTF-8 safe. So it messes up the UTF-8 and returns ISO-8859-1. To do that properly, the mb_strtolower function should be used.

The easiest, proper patch would thus be to replace this line in plugins/SegmentEditor/SegmentFormatter.php:

return strtolower($translation);

with:

return mb_strtolower($translation, 'UTF-8');

Note that I did hardcode "UTF-8" there, because I believe Piwik internally always works with UTF-8. If not, the proper way would be to use mb_internalencoding(charset) at some bootstrap code initially. But grepping through piwik's code, it seems whenever mb* functions are used, you have hardcoded the UTF-8 charset in the function call.

Also note that I do see quite a lot instances of "strtolower" inside other files. You might want to check those to see if they could contain non ISO-8859-1 characters, and use mb_* everywhere, so that the same problem will not show up elsewhere.

HTH.

(*: Unless mbstring.func_overload is set to overwrite strtolower with mb_strtolower in php.ini, which one should not rely on - that should quite likely be the reason why not every Piwik user is affected by this issue when using a language with UTF-8 characters).

schwindelbub commented 8 years ago

@tsteur this solved the problem! Thanks! And it looks - expect the hardcoded UTF-8 - as a good solution.

tsteur commented 8 years ago

@garvinhicking Awesome! 👍 Thanks so much! I will issue a PR. Should have seen it while looking at the code but didn't notice.

garvinhicking commented 8 years ago

Great you approve. Afterwards it always gets obvious. Doesn't matter, now we can move forward. Thanks for listening and advising. :-)

On 13.04.2016, at 22:56, Thomas Steur notifications@github.com wrote:

@garvinhicking Awesome! 👍 Thanks so much! I will issue a PR. Should have seen it while looking at the code but didn't notice.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub