jae-jae / QueryList

:spider: The progressive PHP crawler framework! 优雅的渐进式PHP采集框架。
https://querylist.cc
2.67k stars 443 forks source link

encoding转码失败,变成空白 #34

Closed ctfang closed 4 years ago

ctfang commented 6 years ago

// 失败的 ->encoding('UTF-8','GB2312')

正常的,在结果集后 echo iconv('GB2312', 'UTF-8', $item['title'])."
";

ctfang commented 6 years ago
    $listmain  = $ql->encoding('UTF-8','GBK')->rules([
        'title' => array('dd>a', 'text'),
        'link' => array('dd>a', 'href')
    ])->query()->getData();

// 进入源码,看到转码成功,但是$listmain为空 class EncodeService { public static function convert(QueryList $ql,string $outputEncoding,string $inputEncoding = null) { $html = $ql->getHtml(); $inputEncoding || $inputEncoding = self::detect($html); $html = iconv($inputEncoding,$outputEncoding,$html); dump($inputEncoding,$outputEncoding,$html); $ql->setHtml($html); return $ql; }

wangyouw commented 6 years ago

楼主 查到原因了吗,我这也有这问题

varphper commented 6 years ago

这个问题还没解决吗?

luffyzhao commented 6 years ago

我的解决方案是:

$ql->find('meta[http-equiv="Content-Type"]')->attr('content', 'text/html; charset=utf-8');
qwqcode commented 6 years ago
function handleGbkPage($html)
{
    $html = mb_convert_encoding($html, 'UTF-8', 'GBK');
    $html = preg_replace('/charset=(gb2312|gbk)/is', 'charset=utf-8', $html); // 必须将 <meta/> 中 charset=* 替换为 utf-8,不然 phpQuery 不能解析标签

    return $html;
}

$html = handleGbkPage($html);
$ql = (new QueryList())->html($html);
youngda commented 6 years ago

同样的问题,文档里面的方法都试了还是不行,自己默默写个正则,输出正常。目测采集正常,用了这个匹配就乱码了,楼上哥们给的代码试了也不行。有解决的哥们麻烦@一下,谢谢

qwqcode commented 6 years ago

@youngda 先转码gbk为utf-8 再把 meta 标贴charset=* 替换为 utf-8 我这样就解决了

youngda commented 6 years ago

@Zneiat 这边测试的结果不行,如果把GET到的HTML直接输出,是正常,打开匹配模式输出就乱了

shanezhiu commented 6 years ago

我抓的html页面编码本来就是utf-8,但是获取里面text属性中文值时就是乱码。感觉这是整个库的bug。

youngda commented 6 years ago

@shanezhiu 同感,也有可能是咱们没找对方法,驾驭不了

qwqcode commented 6 years ago

@youngda 发一下你的代码 我看看

shanezhiu commented 6 years ago

@Zneiat

public function handle_content()
{
        $data = $this->spider
            ->rules([
                'title' => ['#activity-name','text']
            ])
            ->get("https://mp.weixin.qq.com/s?src=11&timestamp=1533000601&ver=1031&signature=*LFN6KjIY93ucjNZzMBCspPXRI*0VIxcQpN8alDP5GHZRuSkdqkGT8PlR9ytsfrbLfufk4Fxy3oIWTlGuOpNcj*OjGK9Wf48nFqedKxx6pwXYfTak9*dvH8vgVC7A3xW&new=1")
            ->encoding('UTF-8','GB2312')
            ->query()
            ->getData()
            ->toArray();
        $title = array_pop($data)['title'];
        var_dump($title);exit;
}
shanezhiu commented 6 years ago

@youngda bug的可能性比较大。我去翻翻源码。

qwqcode commented 6 years ago

@shanezhiu 尝试

$url = "https://mp.weixin.qq.com/s?src=11&timestamp=1533000601&ver=1031&signature=*LFN6KjIY93ucjNZzMBCspPXRI*0VIxcQpN8alDP5GHZRuSkdqkGT8PlR9ytsfrbLfufk4Fxy3oIWTlGuOpNcj*OjGK9Wf48nFqedKxx6pwXYfTak9*dvH8vgVC7A3xW&new=1";

$html = file_get_contents($url); // 建议用 Curl
$html = handleGbkPage($html);

$ql = (new QueryList())->html($html); // 导入 html
$data = $ql->rules([
    'title' => ['#activity-name','text']
])->query()->getData()->all();
var_dump($data);die();

function handleGbkPage($html)
{
    $html = mb_convert_encoding($html, 'UTF-8', 'GBK');
    $html = preg_replace('/charset=(gb2312|gbk)/is', 'charset=utf-8', $html); // 必须将 <meta/> 中 charset=* 替换为 utf-8,不然 phpQuery 不能解析标签

    return $html;
}
qwqcode commented 6 years ago

@shanezhiu https://mp.weixin.qq.com/s?src=11&timestamp=1533000601&ver=1031&signature=*LFN6KjIY93ucjNZzMBCspPXRI*0VIxcQpN8alDP5GHZRuSkdqkGT8PlR9ytsfrbLfufk4Fxy3oIWTlGuOpNcj*OjGK9Wf48nFqedKxx6pwXYfTak9*dvH8vgVC7A3xW&new=1 XD 编码本来就是 UTF-8 无需转换

shanezhiu commented 6 years ago

@Zneiat 你可以去除一下encoding的代码,打印title,看看结果。

qwqcode commented 6 years ago

@shanezhiu 似乎讨论的不是同一个问题。。。我遇到的问题是 gbk 转 utf-8 后,没有乱码,但是 phpQuery 依然不能获取内容

shanezhiu commented 6 years ago

@Zneiat 让我感到好奇的是,你运行了你提供的snippet吗?我运行你的结果是:

array (size=1)
  0 => 
    array (size=1)
      'title' => string '1603澶╁悗锛孧H370渚濇棫鏃犳硶纭澶辫仈鐪熸鍘熷洜锛' (length=152)

这结果显然是不正确的。

shanezhiu commented 6 years ago

@Zneiat 我认为这两个都属于编码问题。

qwqcode commented 6 years ago

@shanezhiu 已解决。。。你采集的是微信公众号文章,html 代码开头 <!--headTrap<body></body><head></head><html></html>--> 和结尾 <!--tailTrap<body></body><head></head><html></html>--> 会影响 phpQuery

$url = "https://mp.weixin.qq.com/s?src=11&timestamp=1533000601&ver=1031&signature=*LFN6KjIY93ucjNZzMBCspPXRI*0VIxcQpN8alDP5GHZRuSkdqkGT8PlR9ytsfrbLfufk4Fxy3oIWTlGuOpNcj*OjGK9Wf48nFqedKxx6pwXYfTak9*dvH8vgVC7A3xW&new=1";

$html = file_get_contents($url); // 建议用 Curl

$html = str_replace(['<!--headTrap<body></body><head></head><html></html>-->', '<!--tailTrap<body></body><head></head><html></html>-->'], '', $html);

$ql = (new QueryList())->html($html); // 导入 html
$data = $ql->find('#activity-name')->text();
var_dump($data);
shanezhiu commented 6 years ago

@Zneiat 谢谢你,对,是这个原因。我逐步调试了,确实是这个原因。可能需要管理员帮我移下这些东西到新的issue下。

qwqcode commented 6 years ago

@shanezhiu 哈哈 不用谢 (/ω\)

youngda commented 6 years ago

@Zneiat 谢谢啊,就是这个问题,果然是自己功力尚浅