pharo-contributions / Soup

A Pharo implementation of beautiful Soup
MIT License
6 stars 6 forks source link

SoupParser attribute finding regex has a bug #3

Closed pdebruic closed 2 years ago

pdebruic commented 2 years ago

In the SoupParser>>#initialize method the attrFind instvar regex does not properly handle attributes that run right up against a self closing tag. e.g.

    <ellipse id="XMLID_29_" fill="#0084A5" cx="188.8" cy="147.2" rx="22.7" ry="15.3"/>

The ry attribute is parsed at '"15.3"/' and not '15.3'

The "broken" regex is

\s*([a-zA-Z_][-:.a-zA-Z_0-9]*)(\s*=\s*('[^']*'|"[^"]*"|[\-a-zA-Z0-9./,:;+*%?!&$()_#=~'"@]*))?

changing it to

\s*([a-zA-Z_][-:.a-zA-Z_0-9]*)(\s*=\s*('[^']*'|"[^"]*"|[\-a-zA-Z0-9.,:;+*%?!&$()_#=~'"@]*))?

by removing the last forward slash fixes my issue. But I don't know what else it might break. Certainly parsing trailing slashes in HTML element attributes. I assume they're allowed in the spec. But don't know.

Ducasse commented 2 years ago

I have no idea but if it works for you. :) let us write more tests and see.

seandenigris commented 2 years ago

This bug is currently causing two failing tests: #testFindTag and #testFindTagIfPresent.

The fix makes those two pass, but the following assertion in #testQuotedAttributeValues fail:

self 
        assertSoupFrom: '<a href="http://foo.com/<a> and blah and blah'
        printsAs: '<a href=''"http://foo.com/''></a><a> and blah and blah</a>'.

IMHO this is "better" because an unescaped, unmatched, and unquoted double-quote in an attribute seems like an outlier and the bug described above a much more common case.

BTW where did these brain*k regexes come from? Did we inherit from Beautiful Soup? It could be nice to reimplement in something readable like PP2, but who has the time ha ha?!