oscarotero / Embed

Get info from any web service or page
MIT License
2.09k stars 310 forks source link

HTML title encoding issue #372

Open JanPetterMG opened 4 years ago

JanPetterMG commented 4 years ago

There's an encoding issue in the HTML provider.

$embed->getProviders()['html']->getTitle() Returns: Tilbehør for brytere = Expected: Tilbehør for brytere =<250A | Effektbrytere | Elektroskandia nettbutikk

HTML code: <title>Tilbeh&oslash;r for brytere =&lt;250A | Effektbrytere | Elektroskandia nettbutikk</title>

URL: https://webshop.elektroskandia.no/nor/categories/TELE-DATA-SIKKERHET/INDUSTRI-AUTOMASJON/Effektbrytere/Effektbrytere-MCCB/Tilbeh%C3%B8r-for-brytere-%3D%3C250A/c/702258853

Demo v3: https://oscarotero.com/embed3/demo/index.php?url=https%3A%2F%2Fwebshop.elektroskandia.no%2Fnor%2Fcategories%2FTELE-DATA-SIKKERHET%2FINDUSTRI-AUTOMASJON%2FEffektbrytere%2FEffektbrytere-MCCB%2FTilbeh%25C3%25B8r-for-brytere-%253D%253C250A%2Fc%2F702258853

Version 3.4.8 PHP 7.4.2

oscarotero commented 4 years ago

Seems like the problem is in the strip_tags function, that consider <250A ... is a html tag, so remove it. Not sure how fix it, because removing strip_tags opens the door to other issues.

JanPetterMG commented 4 years ago

I see, but why is the HTML document unencoded before tags are stripped? The contents of the <title> tag clearly says &lt;250A, so this shouldn't be a problem in the first place...

kzgzhn commented 3 years ago

https://stackoverflow.com/questions/2752434/php-domnode-entities-and-nodevalue

https://github.com/oscarotero/Embed/blob/master/src/QueryResult.php#L48