rushter / selectolax

Python binding to Modest and Lexbor engines (fast HTML5 parser with CSS selectors).
MIT License
1.13k stars 69 forks source link

Segmentation fault accessing attributes #135

Open wooque opened 2 weeks ago

wooque commented 2 weeks ago

Here is the script that reproduces the crash

import urllib.request

import selectolax

with urllib.request.urlopen(
    "https://rhodes-ltd-339.myshopify.com"
) as response:
    data = response.read()

html = data.decode("utf-8")
parser = selectolax.lexbor.LexborHTMLParser(html)

for elem in parser.head.iter():
    print("tag", elem.tag)
    print("attributes", elem.attributes)

print("done")

It crashes when trying to access attributes of 3rd comment

mxnurmi commented 14 hours ago

Commenting to indicate another case where the lexbor causes segmentation fault but modest works:

Causes segmentation fault:

import selectolax
parser = selectolax.lexbor.LexborHTMLParser("")
for node in parser.root.traverse():
    parent = node.parent.attributes.get("anything")

print("done")

Works as expected:

import selectolax
parser = selectolax.parser.HTMLParser("")
for node in parser.root.traverse():
    parent = node.parent.attributes.get("anything")

print("done")

In lexbor the issue seems to be that when generating html elements the parents of those generated elements won't have .attributes in some cases