scrapy / parsel

Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors
BSD 3-Clause "New" or "Revised" License
1.11k stars 137 forks source link

Unexpected css parsing caused by the default prefix : `descendant-or-self::` #279

Closed fallingin closed 1 year ago

fallingin commented 1 year ago

Description

This is my first time submitting an issue, so please forgive me if there are any problems. Hoping it helps.

When I use the css selector :nth-child(2) in such an html code structure, I encountered an unexpected code behavior:

<div>
    <div></div>
    <div>(1)Base this div (3)But I actually got this one
        <div></div>
        <div>(2)I want to get this div</div>
        <div></div>
        <div></div>
        <div></div>
        <div></div>
    </div>
</div>

Steps to Reproduce

You can reproduce 100% of the phenomenon I described with the following simple code.

import scrapy
import requests

response = requests.get('https://shouji.baidu.com/detail/3981189')
selector = scrapy.Selector(text=response.text)

base_info_div = selector.css('.app-base-info-content')

print(base_info_div.css('div:nth-child(1) span:nth-child(2)::text').extract_first())
print(base_info_div.css('div:nth-child(2) span:nth-child(2)::text').extract_first())
print(base_info_div.css('div:nth-child(3) span:nth-child(2)::text').extract_first())

Expected behavior:

8.8MB
2.2.3
2020-09-25 15:50:38

Actual behavior:

8.8MB
8.8MB
2020-09-25 15:50:38

Versions

Scrapy       : 2.8.0
lxml         : 4.9.1.0
libxml2      : 2.9.14
cssselect    : 1.0.0
parsel       : 1.5.0
w3lib        : 1.21.0
Twisted      : 22.2.0
Python       : 3.10.9 (main, Mar  1 2023, 12:20:14) [Clang 14.0.6 ]
pyOpenSSL    : 21.0.0 (OpenSSL 1.1.1m  14 Dec 2021)
cryptography : 36.0.1
Platform     : macOS-13.4-arm64-arm-64bit

Additional context

After tracing the code execution, I found that it was the default prefix in the following code causing my problem.

https://github.com/scrapy/parsel/blob/8c39b2fa564295247bc46cd30e313dc449889b12/parsel/csstranslator.py#L147-L152

The prefix descendant-or-self:: causes the code base_info_div.css('div:nth-child(2)) actually select base_info_div itself firstly.

I've figured out what this prefix does and I'm wondering what the point of the default prefix is, whether it's a feature or a way to deal with some special case or just a bug?

wRAR commented 1 year ago

The prefix descendant-or-self:: causes the code base_info_div.css('div:nth-child(2)) actually select base_info_div itself firstly.

Yes, and you could minimize all of this to just base_info_div.css('div') which also returns base_info_div because that's a div. This is clearly by design.

whether it's a feature or a way to deal with some special case or just a bug?

I'm sure it's a design decision. Note that while you quoted a default value in parsel, the cssselect function that it calls has the same default (it's the same function in the base class, after all) and it was there starting with the initial cssselect code in 2007.