encoding转码失败，变成空白

ctfang commented 6 years ago

// 失败的 ->encoding('UTF-8','GB2312')

正常的，在结果集后 echo iconv('GB2312', 'UTF-8', $item['title'])."
";

ctfang commented 6 years ago

    $listmain  = $ql->encoding('UTF-8','GBK')->rules([
        'title' => array('dd>a', 'text'),
        'link' => array('dd>a', 'href')
    ])->query()->getData();

// 进入源码，看到转码成功，但是$listmain为空 class EncodeService { public static function convert(QueryList $ql,string $outputEncoding,string $inputEncoding = null) { $html = $ql->getHtml(); $inputEncoding || $inputEncoding = self::detect($html); $html = iconv($inputEncoding,$outputEncoding,$html); dump($inputEncoding,$outputEncoding,$html); $ql->setHtml($html); return $ql; }

wangyouw commented 6 years ago

楼主查到原因了吗，我这也有这问题

varphper commented 6 years ago

这个问题还没解决吗？

luffyzhao commented 6 years ago

我的解决方案是：

$ql->find('meta[http-equiv="Content-Type"]')->attr('content', 'text/html; charset=utf-8');

qwqcode commented 6 years ago

function handleGbkPage($html)
{
    $html = mb_convert_encoding($html, 'UTF-8', 'GBK');
    $html = preg_replace('/charset=(gb2312|gbk)/is', 'charset=utf-8', $html); // 必须将 <meta/> 中 charset=* 替换为 utf-8，不然 phpQuery 不能解析标签

    return $html;
}

$html = handleGbkPage($html);
$ql = (new QueryList())->html($html);

youngda commented 6 years ago

同样的问题，文档里面的方法都试了还是不行，自己默默写个正则，输出正常。目测采集正常，用了这个匹配就乱码了，楼上哥们给的代码试了也不行。有解决的哥们麻烦@一下，谢谢

qwqcode commented 6 years ago

@youngda 先转码gbk为utf-8 再把 meta 标贴charset=* 替换为 utf-8 我这样就解决了

youngda commented 6 years ago

@Zneiat 这边测试的结果不行，如果把GET到的HTML直接输出，是正常，打开匹配模式输出就乱了

shanezhiu commented 6 years ago

我抓的html页面编码本来就是utf-8，但是获取里面text属性中文值时就是乱码。感觉这是整个库的bug。

youngda commented 6 years ago

@shanezhiu 同感，也有可能是咱们没找对方法，驾驭不了

qwqcode commented 6 years ago

@youngda 发一下你的代码我看看

shanezhiu commented 6 years ago

@Zneiat

public function handle_content()
{
        $data = $this->spider
            ->rules([
                'title' => ['#activity-name','text']
            ])
            ->get("https://mp.weixin.qq.com/s?src=11&timestamp=1533000601&ver=1031&signature=*LFN6KjIY93ucjNZzMBCspPXRI*0VIxcQpN8alDP5GHZRuSkdqkGT8PlR9ytsfrbLfufk4Fxy3oIWTlGuOpNcj*OjGK9Wf48nFqedKxx6pwXYfTak9*dvH8vgVC7A3xW&new=1")
            ->encoding('UTF-8','GB2312')
            ->query()
            ->getData()
            ->toArray();
        $title = array_pop($data)['title'];
        var_dump($title);exit;
}

shanezhiu commented 6 years ago

@youngda bug的可能性比较大。我去翻翻源码。

qwqcode commented 6 years ago

@shanezhiu 尝试

$url = "https://mp.weixin.qq.com/s?src=11&timestamp=1533000601&ver=1031&signature=*LFN6KjIY93ucjNZzMBCspPXRI*0VIxcQpN8alDP5GHZRuSkdqkGT8PlR9ytsfrbLfufk4Fxy3oIWTlGuOpNcj*OjGK9Wf48nFqedKxx6pwXYfTak9*dvH8vgVC7A3xW&new=1";

$html = file_get_contents($url); // 建议用 Curl
$html = handleGbkPage($html);

$ql = (new QueryList())->html($html); // 导入 html
$data = $ql->rules([
    'title' => ['#activity-name','text']
])->query()->getData()->all();
var_dump($data);die();

function handleGbkPage($html)
{
    $html = mb_convert_encoding($html, 'UTF-8', 'GBK');
    $html = preg_replace('/charset=(gb2312|gbk)/is', 'charset=utf-8', $html); // 必须将 <meta/> 中 charset=* 替换为 utf-8，不然 phpQuery 不能解析标签

    return $html;
}

qwqcode commented 6 years ago

@shanezhiu https://mp.weixin.qq.com/s?src=11&timestamp=1533000601&ver=1031&signature=*LFN6KjIY93ucjNZzMBCspPXRI*0VIxcQpN8alDP5GHZRuSkdqkGT8PlR9ytsfrbLfufk4Fxy3oIWTlGuOpNcj*OjGK9Wf48nFqedKxx6pwXYfTak9*dvH8vgVC7A3xW&new=1 XD 编码本来就是 UTF-8 无需转换

shanezhiu commented 6 years ago

@Zneiat 你可以去除一下encoding的代码，打印title，看看结果。

qwqcode commented 6 years ago

@shanezhiu 似乎讨论的不是同一个问题。。。我遇到的问题是 gbk 转 utf-8 后，没有乱码，但是 phpQuery 依然不能获取内容

shanezhiu commented 6 years ago

@Zneiat 让我感到好奇的是，你运行了你提供的snippet吗？我运行你的结果是：

array (size=1)
  0 => 
    array (size=1)
      'title' => string '1603æ¾¶âæéå§H370æ¸æ¿æ£«éç³ç¡¶çºî¿î»æ¾¶è¾«ä»éªç¸îéç·æ´é' (length=152)

这结果显然是不正确的。

shanezhiu commented 6 years ago

@Zneiat 我认为这两个都属于编码问题。

qwqcode commented 6 years ago

@shanezhiu 已解决。。。你采集的是微信公众号文章，html 代码开头  和结尾  会影响 phpQuery

$url = "https://mp.weixin.qq.com/s?src=11&timestamp=1533000601&ver=1031&signature=*LFN6KjIY93ucjNZzMBCspPXRI*0VIxcQpN8alDP5GHZRuSkdqkGT8PlR9ytsfrbLfufk4Fxy3oIWTlGuOpNcj*OjGK9Wf48nFqedKxx6pwXYfTak9*dvH8vgVC7A3xW&new=1";

$html = file_get_contents($url); // 建议用 Curl

$html = str_replace(['<!--headTrap<body></body><head></head><html></html>-->', '<!--tailTrap<body></body><head></head><html></html>-->'], '', $html);

$ql = (new QueryList())->html($html); // 导入 html
$data = $ql->find('#activity-name')->text();
var_dump($data);

shanezhiu commented 6 years ago

@Zneiat 谢谢你，对，是这个原因。我逐步调试了，确实是这个原因。可能需要管理员帮我移下这些东西到新的issue下。

qwqcode commented 6 years ago

@shanezhiu 哈哈不用谢 (/ω＼)

youngda commented 6 years ago

@Zneiat 谢谢啊，就是这个问题，果然是自己功力尚浅

jae-jae / QueryList

encoding转码失败，变成空白 #34