jae-jae / QueryList

:spider: The progressive PHP crawler framework! 优雅的渐进式PHP采集框架。
https://querylist.cc
2.65k stars 441 forks source link

同一元素多个在rules采集有问题 #35

Closed zeroone2005 closed 6 years ago

zeroone2005 commented 6 years ago

html如下

<li>
    <a href="/toutiao/7633.html" target="_blank"><img src="/uploads/allimg/c180413/152360E3O9160-112P60.jpg" alt="图1"></a>
    <p>分类:<a href="/toutiao/">toutiao</a></p>
    <p class="p_title"><a href="/toutiao/7633.html" target="_blank">图1</a></p>
</li>
<li>
    <a href="/toutiao/7632.html" target="_blank"><img src="/uploads/allimg/c180413/152360E3O9160-112P60.jpg" alt="图2"></a>
    <p>分类:<a href="/toutiao/">toutiao</a></p>
    <p class="p_title"><a href="/toutiao/7632.html" target="_blank">图2</a></p>
</li>
<li>
    <a href="/toutiao/7631html" target="_blank"><img src="/uploads/allimg/c180413/152360E3O9160-112P60.jpg" alt="图3"></a>
    <p>分类:<a href="/toutiao/">toutiao</a></p>
    <p class="p_title"><a href="/toutiao/7631.html" target="_blank">图3</a></p>
</li>
</ul>
...

采集规则如下:

            $rules = [
                'url'          => ['ul.img>li>a', 'href'],
                'img'          => ['ul.img>li>a>img', 'src'],
                'alt'          => ['ul.img>li>a>img', 'alt'],
                'category'     => ['ul.img>li>p>a', 'text'],
                'category_url' => ['ul.img>li>p>a', 'href']
            ];

我是想采集每个li下第一个p节点下A的href 和text 但是输出的结果 第一条采集出来是没有问题,后面就有问题了

                    [0] => Array
                        (
                            [url] => /toutiao/7633.html
                            [img] => /uploads/allimg/c180413/152360E3O9160-112P60.jpg
                            [alt] => 图片1
                            [category] => toutiao
                            [category_url] => /toutiao/
                        )

                    [1] => Array
                        (
                            [url] => /toutiao7632.html
                            [img] => /uploads/allimg/c180413/152360D3FX10-R5233.jpg
                            [alt] => 图片2
                            [category] => 图片2
                            [category_url] => /toutiao/7633.html
                        )

                    [2] => Array
                        (
                            [url] => /toutiao/7631.html
                            [img] => /uploads/allimg/c180413/152360C3V25Z-542957.jpg
                            [alt] => 图片3
                            [category] => toutiao
                            [category_url] => /toutiao/
                        )

第二条li采集就变成第一条li下第二个p的a,而不是第二条li的第一个p

jae-jae commented 6 years ago

选择器问题,采集列表需要使用列表选择器range,文档:https://doc.querylist.cc/site/index/doc/14