scinfu / SwiftSoup

SwiftSoup: Pure Swift HTML Parser, with best of DOM, CSS, and jquery (Supports Linux, iOS, Mac, tvOS, watchOS)
https://scinfu.github.io/SwiftSoup/
MIT License
4.52k stars 345 forks source link

Unable to parse documents with un-quoted attribute values #217

Open dreystone opened 2 years ago

dreystone commented 2 years ago

I encountered some pages that were using minify, and the meta and link tags in the head were missing the quotes for the attribute values.

According to WC3, this is permitted part of HTML5 spec for attributes:

https://html.spec.whatwg.org/multipage/syntax.html#attributes-2

Here is a code example which fails:

<!DOCTYPE html>
<html lang=en-US>
<head>
    <meta charset=utf-8><meta content="IE=edge" http-equiv=X-UA-Compatible>
    <meta content=unsafe-url name=referrer>
    <link href=/images/favicons/favicon--16x16.png rel=icon sizes=16x16 type=image/png>
</head>
<body>
    page contents
</body>
</html>
scinfu commented 2 years ago

I parsed this HTML without problems. Can you explain what does not work?

nikolaykargin commented 2 years ago

This snippet was correctly parsed with the latest version of the library. Below is the code snippet and output.

import Foundation
import SwiftSoup

var html = """
<!DOCTYPE html>
<html lang=en-US>
<head>
    <meta charset=utf-8><meta content="IE=edge" http-equiv=X-UA-Compatible>
    <meta content=unsafe-url name=referrer>
    <link href=/images/favicons/favicon--16x16.png rel=icon sizes=16x16 type=image/png>
</head>
<body>
    page contents
</body>
</html>
"""

let doc = try SwiftSoup.parse(html)

let metaElements = try doc.select("head *")
for meta in metaElements {
    if let attributes = meta.getAttributes() {
        print(meta.tagName(), attributes.compactMap { "\($0.getKey())=\($0.getValue())" })
    }
}

print(try doc.body()?.text() ?? "–")
meta ["charset=utf-8"]
meta ["content=IE=edge", "http-equiv=X-UA-Compatible"]
meta ["content=unsafe-url", "name=referrer"]
link ["href=/images/favicons/favicon--16x16.png", "rel=icon", "sizes=16x16", "type=image/png"]
page contents