jae-jae / QueryList

:spider: The progressive PHP crawler framework! 优雅的渐进式PHP采集框架。
https://querylist.cc
2.66k stars 441 forks source link

QueryList 4.2.0-4.2.7 读取gb2312页面乱码 #140

Closed guokeu closed 5 months ago

guokeu commented 3 years ago

\QueryList-4.2.7\vendor\jaeger\phpquery-single\phpQuery.php:462行

    protected function contentTypeFromHTML($markup) {
        $matches = array();
        // find meta tag
        preg_match('@<meta[^>]+http-equiv\\s*=\\s*(["|\']*)Content-Type\\1([^>]*)>@i',// 这里有bug,这是修复后的

这里导致读取GB2312页面乱码,即便修复了这里也有别的问题,希望作者修复一下。

jenawant commented 9 months ago

我对两个文件做了修改,然后采集GB2312可以正常显示: 1,phpQuery.php,检测网页内容类型的正则,contentTypeFromHTML方法,'@<meta[^>]+http-equiv\\s*=\\s*(["|\']?)Content-Type\\1([^>]+?)>@i',charsetFixHTML方法,第1处'@\s*<meta[^>]+http-equiv\\s*=\\s*(["|\']?)Content-Type\\1([^>]+?)>@i',第2处$metaContentType = $matches[2][0]; $markup = substr($markup, 0, $matches[0][1]) . substr($markup, $matches[2][1] + strlen($metaContentType)+1); $headStart = stripos($markup, '<head>'); $markup = substr($markup, 0, $headStart + 6) . '<meta http-equiv="Content-Type"'.$metaContentType.'>' . substr($markup, $headStart + 6);

2,HttpService.php,$ql->setHtml($html, 'utf-8');默认设置为UTF-8

jae-jae commented 5 months ago

fixed