Closed chriseidhof closed 3 years ago
Makes sense, thanks!
After skimming the implementation of CFXMLCreateStringByEscapingEntities
, I tried a similar approach to get named escape sequences – which I believe are a better fit given the use case.
I ran a benchmark of this:
import Foundation
extension String {
func numerical_addingXMLEncoding() -> String {
var result = ""
result.reserveCapacity(count)
return unicodeScalars.reduce(into: result, { $0.append($1.numberical_escapingIfNeeded) })
}
func named_addingXMLEncoding() -> String {
var result = ""
result.reserveCapacity(count)
return unicodeScalars.reduce(into: result, { $0.append($1.named_escapingIfNeeded) })
}
func namedUTF8_addingXMLEncoding() -> String {
var result = ""
result.reserveCapacity(count)
return utf8.reduce(into: result, { $0.append($1.named_escapingIfNeeded) })
}
func cf_addingXMLEncoding() -> String {
withCFString { string -> NSString in
CFXMLCreateStringByEscapingEntities(nil, string, nil)
} as String
}
private func withCFString<Result>(_ body: (CFString) throws -> Result) rethrows -> Result {
try withCString { cString in
try body(CFStringCreateWithCString(nil, cString, CFStringBuiltInEncodings.UTF8.rawValue))
}
}
}
extension UnicodeScalar {
public var htmlEscaped: String {
return "&#\(value);"
}
/// Escapes the scalar only if it needs to be escaped for Unicode pages.
///
/// [Reference](http://wonko.com/post/html-escaping)
fileprivate var numberical_escapingIfNeeded: String {
switch value {
case 33, 34, 36, 37, 38, 39, 43, 44, 60, 61, 62, 64, 91, 93, 96, 123, 125: return htmlEscaped
default: return String(self)
}
}
fileprivate var named_escapingIfNeeded: String {
switch value {
case ("&" as Unicode.Scalar).value:
return "&"
case ("<" as Unicode.Scalar).value:
return "<"
case (">" as Unicode.Scalar).value:
return ">"
case ("\'" as Unicode.Scalar).value:
return "'"
case ("\"" as Unicode.Scalar).value:
return """
default:
return String(self)
}
}
}
private extension UInt8 {
var named_escapingIfNeeded: String {
switch self {
case 38 : // "&"
return "&"
case 60 : // "<"
return "<"
case 62 : // ">"
return ">"
case 39 : // "\'"
return "'"
case 34 : // "\""
return """
default:
return String(Unicode.Scalar(self))
}
}
}
with this String:
let string = """
<?xml version="1.0" encoding="UTF-8"?>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
"""
benchmark("Escape with CF") {
let _ = string.cf_addingXMLEncoding()
}
benchmark("Escape with Numbers (Chris)") {
let _ = string.numerical_addingXMLEncoding()
}
benchmark("Escape with Names (Robb)") {
let _ = string.named_addingXMLEncoding()
}
benchmark("Escape with Names (Robb UTF8)") {
let _ = string.namedUTF8_addingXMLEncoding()
}
and got the following result (M1 MacBook Air, 16 GB RAM):
name time std iterations
-----------------------------------------------------------------
Escape with CF 4000.000 ns ± 6.96 % 349659
Escape with Numbers (Chris) 7416.000 ns ± 5.20 % 188748
Escape with Names (Robb) 5333.000 ns ± 6.70 % 261855
Escape with Names (Robb UTF8) 5208.000 ns ± 5.09 % 269007
seems like the compiler doesn't inline the string interpolation in Maybe it's just the number of switch cases? Dunno ¯_(ツ)_/¯ htmlEscaped
?
Either way, I suggest we use a named approach and hardcode the handful of escape sequences we need.
Looks good to me! I have absolutely no preference as to what to use. If anything, faster = better!
Do you think it's better to drop CF and have uniform output across all platforms, or to use CF on Apple platforms and have a faster implementation?
Do you think it's better to drop CF and have uniform output across all platforms, or to use CF on Apple platforms and have a faster implementation?
Yeah, I think using only one implementation is simpler. I also did a quick test and building a character set for the five characters and testing if they are part of the string first seemed to also help performance, so maybe we can get closer to Apple still.
If you merge in main
, you can try the above benchmark if you want to
Okay, I tried both named implementations on our code base. It turns out that the UTF8 version is wrong. I added a test for this. Because it turns every byte into a string it breaks with scalars that consists of multiple bytes.
It seems that
CFXMLCreateStringByEscapingEntities
is not available under Linux. (Or maybe I'm wrong, that would simplify things).This PR works around that by adding quotes manually. I haven't tested the performance, but I think this will be slower than the built-in escaping, so I only enabled it for Linux. This might give different output depending on the system.
We use the same algorithm for our HTML library in the Swift Talk backend, and it has served us well so far.
To test this locally on my M1 I have the following script (modified from here):
When you run that you get a console into which you can type
swift test
and verify that all tests pass on Linux.Once/if this is merged in, we could also enable GitHub actions.