thaolt / phpquery

Automatically exported from code.google.com/p/phpquery
0 stars 0 forks source link

Bug in parser when encoding is not UTF-8 #223

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Run the following example:
<?
 require_once("phpQuery-onefile.php");
 $content = file_get_contents("http://www.prosveshenie.tv/index.php?id=5");
 echo "Code #1:\n";
 echo $content;
 $doc = phpQuery::newDocumentHTML($content);
 echo "Code #2:\n";
 echo pq($doc)->html();
?>

Note that this page is in "windows-1251" encoding.

What is the expected output? What do you see instead?
The structure of HTML nodes in sections "Code #1" and "Code #2" must be the 
same. While in fact big part of the source file is lost in "Code #2"...

What version of the product are you using? On what operating system?
phpQuery 0.9.5, CentOS (Linux)

Please provide any additional information below.

I tried manually converting source code to utf-8 on line #3:
--8<-----------------------
$content = iconv("windows-1251", "utf-8", $content);
// ...
$doc = phpQuery::newDocumentHTML($content);
--8<-----------------------

as well as specified encoding as a second parameter in newDocumentHTML on line 
#5:
--8<-----------------------
 $doc = phpQuery::newDocumentHTML($content, "windows-1251");
--8<-----------------------

None of these are working as expected...

Original issue reported on code.google.com by a...@vefire.ru on 11 Dec 2012 at 11:01

GoogleCodeExporter commented 9 years ago
Hello... was wondering if you ever found a way to fix this?

Thanks!
Patrick

Original comment by patrick....@infranet.com on 4 Mar 2013 at 3:11

GoogleCodeExporter commented 9 years ago
Hello, Patrick!

Unfortunately, no.

Original comment by a...@vefire.ru on 4 Mar 2013 at 5:22