scrapy / scrapely

A pure-python HTML screen-scraping library
1.86k stars 272 forks source link

Change _parse_tag to follow HTML spec for repeated attributes #78

Closed ruairif closed 8 years ago

ruairif commented 9 years ago

According to the HTML5 spec when parsing a tag with repeated attribute names only the first attribute with that name should be taken and following attributes with that name ignored.

This could change spiders that are already in use (as evidenced by having to change one of the test cases to accommodate the change)

The relevant part of the spec is here. The relevant text is:

When the user agent leaves the attribute name state (and before emitting the tag token,
if appropriate), the complete attribute's name must be compared to the other attributes on
the same token; if there is already an attribute on the token with the exact same name,
then this is a parse error and the new attribute must be removed from the token.
kmike commented 9 years ago

This makes sense. Could you please check test failures - Travis is red?

ruairif commented 9 years ago

Sorry about that. I ran the tests locally from the wrong directory. The tests pass now.