scrapy / parsel

Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors
BSD 3-Clause "New" or "Revised" License
1.15k stars 146 forks source link

css选择器无法选择h3下的p标签 #203

Open xujiang1 opened 4 years ago

xujiang1 commented 4 years ago
from parsel import Selector

html = "<h3>吉林大学社会科学学报<p>Jilin University Journal Social Sciences Edition</p></h3>"

sel = Selector(html)

print(sel.css("h3"))

print(sel.css("h3 > p::text").getall())

当我使用css选择器时 无法获取h3下的p标签,结果如下:


[<Selector xpath='descendant-or-self::h3' data='<h3>吉林大学社会科学学报</h3>'>]

[]

当我将p标签换成其他标签时可以正常获取:

from parsel import Selector

html = "<h3>吉林大学社会科学学报<em>Jilin University Journal Social Sciences Edition</em></h3>"

sel = Selector(html)

print(sel.css("h3"))

print(sel.css("h3 > em::text").getall())

结果:

[<Selector xpath='descendant-or-self::h3' data='<h3>吉林大学社会科学学报<em>Jilin University Jo...'>]

['Jilin University Journal Social Sciences Edition']
Gallaecio commented 4 years ago

Interesting. I’m marking it as a bug, although I am not 100% it is one, and even if it is, it is probably an upstream issue from lxml.

felipeboffnunes commented 2 years ago

@Gallaecio what about this https://stackoverflow.com/questions/19779519/is-it-valid-to-have-paragraph-elements-inside-of-a-heading-tag-in-html5-p-insid

Gallaecio commented 2 years ago

I don’t think Parsel intends to require that input HTML is standard-compliant. Ideally, anything that a browser accepts we should accept as well, because HTML documents in the wild care about browser support more than they care about standard compliance.

Browsers seem to accept this syntax.