css选择器无法选择h3下的p标签

xujiang1 commented 4 years ago

from parsel import Selector

html = "<h3>吉林大学社会科学学报<p>Jilin University Journal Social Sciences Edition</p></h3>"

sel = Selector(html)

print(sel.css("h3"))

print(sel.css("h3 > p::text").getall())

当我使用css选择器时无法获取h3下的p标签,结果如下：


[<Selector xpath='descendant-or-self::h3' data='<h3>吉林大学社会科学学报</h3>'>]

[]

当我将p标签换成其他标签时可以正常获取：

from parsel import Selector

html = "<h3>吉林大学社会科学学报<em>Jilin University Journal Social Sciences Edition</em></h3>"

sel = Selector(html)

print(sel.css("h3"))

print(sel.css("h3 > em::text").getall())

结果:

[<Selector xpath='descendant-or-self::h3' data='<h3>吉林大学社会科学学报<em>Jilin University Jo...'>]

['Jilin University Journal Social Sciences Edition']

Gallaecio commented 4 years ago

Interesting. I’m marking it as a bug, although I am not 100% it is one, and even if it is, it is probably an upstream issue from lxml.

felipeboffnunes commented 2 years ago

@Gallaecio what about this https://stackoverflow.com/questions/19779519/is-it-valid-to-have-paragraph-elements-inside-of-a-heading-tag-in-html5-p-insid

Gallaecio commented 2 years ago

I don’t think Parsel intends to require that input HTML is standard-compliant. Ideally, anything that a browser accepts we should accept as well, because HTML documents in the wild care about browser support more than they care about standard compliance.

Browsers seem to accept this syntax.

scrapy / parsel

css选择器无法选择h3下的p标签 #203