scinfu / SwiftSoup

SwiftSoup: Pure Swift HTML Parser, with best of DOM, CSS, and jquery (Supports Linux, iOS, Mac, tvOS, watchOS)
https://scinfu.github.io/SwiftSoup/
MIT License
4.52k stars 345 forks source link

Clean document and encoding for maito: protocol results in unexpected output. #268

Open TasikBeyond opened 5 months ago

TasikBeyond commented 5 months ago

Bug Report

The Clean document function is encoding characters twice. Only happening when a %20 and a [ or ] are included in the original html data.

How to Reproduce

let html = #"<a href="mailto:mail@example.com?subject=Job%20Requisition[NID]">Send</a></body></html>"#

let document = try SwiftSoup.parse(html)
let outputSettings = OutputSettings()
outputSettings.prettyPrint(pretty: false)
document.outputSettings(outputSettings)

let headWhitelist: Whitelist = {
    do {
        let customWhitelist = Whitelist.none()
        try customWhitelist
            .addTags("a")
            .addAttributes("a", "href")
            .addProtocols("a", "href", "mailto")
        return customWhitelist
    } catch {
        fatalError("Couldn't init head whitelist")
    }
}()
try headWhitelist

print("Original Document: ", document)
let cleaned = try Cleaner(headWhitelist: headWhitelist, bodyWhitelist: headWhitelist).clean(document)
print("Original Document: ", document)
print("Clean Document: ", cleaned)

Expected Behavior

Clean let html = #"<a href="mailto:mail@example.com?subject=Job%20Requisition[NID]">Send</a></body></html>"#

Should result in

<html>
 <head></head>
 <body>
  <a href="mailto:mail@example.com?subject=Job%20Requisition%5BNID%5B">Send</a>
 </body>
</html>

Actual Behavior

<html>
 <head></head>
 <body>
  <a href="mailto:mail@example.com?subject=Job%2520Requisition%5BNID%5D">Send</a>
 </body>
</html>

Note: %2520 appears to be %20 getting encoded again.

Environment

Swift Soup Version: 2.6.1 Xcode Version: 15.3

Additional Notes

I print the original document before and after the clean(document) function as it appears both the original document and the clean document are being modified.

print("Original Document: ", document)
let cleaned = try Cleaner(headWhitelist: headWhitelist, bodyWhitelist: headWhitelist).clean(document)
print("Original Document: ", document)