ivopetkov / html5-dom-document-php

A better HTML5 parser for PHP.
MIT License
599 stars 40 forks source link

Characters in non-Latin alphabets are encoded to Unicode #47

Closed Dan0sz closed 2 years ago

Dan0sz commented 3 years ago

When using this library on a site written in e.g. Greek, it seems to convert all Greek characters to Unicode (I think) in the source. Is there a way to prevent this from happening?

E.g. #tab-σημαντικές-πληροφορίες is converted to #tab-%CF%83%CE%B7%CE%BC%CE%B1%CE%BD%CF%84%CE%B9%CE%BA%CE%AD%CF%82-%CF%80%CE%BB%CE%B7%CF%81%CE%BF%CF%86%CE%BF%CF%81%CE%AF%CE%B5%CF%82

It might also be a PHP configuration issue? But I'm not sure.

Any assistance would be appreciated!

ivopetkov commented 3 years ago

I've just tested this on PHP 8.0 and it works fine (no conversion/encoding). Send me a sample code to test, if you want.

Dan0sz commented 3 years ago

Sorry, didn't get a notification about your reply.

I'm not sure what kind of sample code you're looking for, but it's occurring on e.g. this page: https://testwp.uplab.gr/%CE%B5%CE%BD%CF%84%CE%BF%CE%BC%CE%BF%CE%B1%CF%80%CF%89%CE%B8%CE%B7%CF%84%CE%B9%CE%BA%CE%AC/esquito-lotion/

I suppose you could copy the HTML of this page and test again?

When inspecting the middle tab underneath the mian product image, you'll notice the encoding:

afbeelding

afbeelding

Dan0sz commented 2 years ago

Hi, haven't heard back from you. Could it have something to do with the fact that it's a a href element?

ivopetkov commented 2 years ago

Yep, looks like a href behaves differently than the other attributes (a class for example). My library uses internally the native DomDocument library, and the latter is the one that makes this change.

Do you know why this might happen?

Dan0sz commented 2 years ago

Hi,

Thanks for your help. I've resolved the issue by filtering the HTML before returning it. Basically, by making a regex that captures urls and running urldecode on each of those URLs before returning it.

    preg_match_all('/href=[\'"](?<url>.*?)[\'"]/', $html, $urls);

    if (!isset($urls['url'])) {
        return $html;
    }

    foreach ($urls['url'] as $url) {
        if (strpos($url, '%') === false) { // Exact match, because otherwise position 0 will return true as well.
            continue;
        }

        $search[] = $url;
        $replace[] = urldecode($url);
    }

    return str_replace($search, $replace, $html);

I'll close this. :-)

SarahTrees commented 2 years ago

Thank you for this solution! In my case the link starts directly with an umlaut: <a href="%C3%96kologie.html" title="Wechselbeziehungen zwischen Lebewesen und ihrer Umwelt">Ökologie - Wechselbeziehungen</a>

(strpos($url, '%') == false) is true. I change that to

if (strpos($url, '%') > -1) {
    $search[] = $url;
    $replace[] = urldecode ($url);
}