rezakho / ganon

Automatically exported from code.google.com/p/ganon
0 stars 0 forks source link

getPlainText html_entity_decode encoding error #31

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
Running version lower than PHP 5.3 (and even higher versions, if you believe 
the comments at php.net) does not default to UTF-8, but to ISO-8859-1, when 
using html_entity_decode(...) function. This creates problems when using 
getPlainText(), because it does not take into account the encoding.

What will reproduce the problem?
Just parse something in an encoding other than *YOUR* html_entity_decode(...) 
function and it should be easy to see the problems.

What is the expected output? What do you see instead?
Expected output are correctly converted html enttities. 
I get an empty string, like " " => ""
but I would expect to see, " " => " "

Which version are you using?
Ganon single file PHP5 (rev. #78)

Please provide any additional information below.
It can be easily resolved by replacing the function getPlainText() from 
return preg_replace('`\s+`', ' ', html_entity_decode($this->toString(true, 
true, true), ENT_QUOTES));

to

return preg_replace('`\s+`', ' ', html_entity_decode($this->toString(true, 
true, true), ENT_QUOTES, $this->getEncoding()));

Original issue reported on code.google.com by thomas.a...@gmail.com on 19 Jan 2013 at 1:57

GoogleCodeExporter commented 8 years ago

Same issues, I suggest:

preg_replace ( '`[\xA0\s]+`', ' ', html_entity_decode ( $this->toString(true, 
true, true), ENT_QUOTES | ENT_HTML5, $this->getEncoding() ) );

Original comment by alessand...@gmail.com on 8 Nov 2013 at 2:58