Linux Support - Githubissues

chriseidhof commented 3 years ago

It seems that CFXMLCreateStringByEscapingEntities is not available under Linux. (Or maybe I'm wrong, that would simplify things).

This PR works around that by adding quotes manually. I haven't tested the performance, but I think this will be slower than the built-in escaping, so I only enabled it for Linux. This might give different output depending on the system.

We use the same algorithm for our HTML library in the Swift Talk backend, and it has served us well so far.

To test this locally on my M1 I have the following script (modified from here):

#!/bin/bash
docker run --rm --privileged --interactive --tty \
                                  --security-opt seccomp=unconfined \
                  --volume "$(pwd):/src" \
                                  --workdir "/src" \
                                  swiftarm/swift

When you run that you get a console into which you can type swift test and verify that all tests pass on Linux.

Once/if this is merged in, we could also enable GitHub actions.

robb commented 3 years ago

Makes sense, thanks!

After skimming the implementation of CFXMLCreateStringByEscapingEntities, I tried a similar approach to get named escape sequences – which I believe are a better fit given the use case.

I ran a benchmark of this:

import Foundation

extension String {
    func numerical_addingXMLEncoding() -> String {
        var result = ""
        result.reserveCapacity(count)
        return unicodeScalars.reduce(into: result, { $0.append($1.numberical_escapingIfNeeded) })
    }

    func named_addingXMLEncoding() -> String {
        var result = ""
        result.reserveCapacity(count)
        return unicodeScalars.reduce(into: result, { $0.append($1.named_escapingIfNeeded) })
    }

    func namedUTF8_addingXMLEncoding() -> String {
        var result = ""
        result.reserveCapacity(count)
        return utf8.reduce(into: result, { $0.append($1.named_escapingIfNeeded) })
    }

    func cf_addingXMLEncoding() -> String {
        withCFString { string -> NSString in
            CFXMLCreateStringByEscapingEntities(nil, string, nil)
        } as String
    }

    private func withCFString<Result>(_ body: (CFString) throws -> Result) rethrows -> Result {
        try withCString { cString in
            try body(CFStringCreateWithCString(nil, cString, CFStringBuiltInEncodings.UTF8.rawValue))
        }
    }
}

extension UnicodeScalar {
    public var htmlEscaped: String {
        return "&#\(value);"
    }

    /// Escapes the scalar only if it needs to be escaped for Unicode pages.
    ///
    /// [Reference](http://wonko.com/post/html-escaping)
    fileprivate var numberical_escapingIfNeeded: String {
        switch value {
        case 33, 34, 36, 37, 38, 39, 43, 44, 60, 61, 62, 64, 91, 93, 96, 123, 125: return htmlEscaped
        default: return String(self)
        }

    }

    fileprivate var named_escapingIfNeeded: String {
        switch value {
        case ("&" as Unicode.Scalar).value:
            return "&amp;"
        case ("<" as Unicode.Scalar).value:
            return "&lt;"
        case (">" as Unicode.Scalar).value:
            return "&gt;"
        case ("\'" as Unicode.Scalar).value:
            return "&apos;"
        case ("\"" as Unicode.Scalar).value:
            return "&quot;"
        default:
            return String(self)
        }
    }
}

private extension UInt8 {
    var named_escapingIfNeeded: String {
        switch self {
        case 38 : // "&"
            return "&amp;"
        case 60 : // "<"
            return "&lt;"
        case 62 : // ">"
            return "&gt;"
        case 39 : // "\'"
            return "&apos;"
        case 34 : // "\""
            return "&quot;"
        default:
            return String(Unicode.Scalar(self))
        }
    }
}

with this String:

let string = """
        <?xml version="1.0" encoding="UTF-8"?>
        <note>
        <to>Tove</to>
        <from>Jani</from>
        <heading>Reminder</heading>
        <body>Don't forget me this weekend!</body>
        </note>
        """

benchmark("Escape with CF") {
    let _ = string.cf_addingXMLEncoding()
}

benchmark("Escape with Numbers (Chris)") {
    let _ = string.numerical_addingXMLEncoding()
}

benchmark("Escape with Names (Robb)") {
    let _ = string.named_addingXMLEncoding()
}

benchmark("Escape with Names (Robb UTF8)") {
    let _ = string.namedUTF8_addingXMLEncoding()
}

and got the following result (M1 MacBook Air, 16 GB RAM):

name                          time          std        iterations
-----------------------------------------------------------------
Escape with CF                  4000.000 ns ±   6.96 %     349659
Escape with Numbers (Chris)     7416.000 ns ±   5.20 %     188748
Escape with Names (Robb)        5333.000 ns ±   6.70 %     261855
Escape with Names (Robb UTF8)   5208.000 ns ±   5.09 %     269007

~~seems like the compiler doesn't inline the string interpolation in htmlEscaped?~~ Maybe it's just the number of switch cases? Dunno ¯_(ツ)_/¯

Either way, I suggest we use a named approach and hardcode the handful of escape sequences we need.

chriseidhof commented 3 years ago

Looks good to me! I have absolutely no preference as to what to use. If anything, faster = better!

Do you think it's better to drop CF and have uniform output across all platforms, or to use CF on Apple platforms and have a faster implementation?

robb commented 3 years ago

Do you think it's better to drop CF and have uniform output across all platforms, or to use CF on Apple platforms and have a faster implementation?

Yeah, I think using only one implementation is simpler. I also did a quick test and building a character set for the five characters and testing if they are part of the string first seemed to also help performance, so maybe we can get closer to Apple still.

robb commented 3 years ago

If you merge in main, you can try the above benchmark if you want to

chriseidhof commented 3 years ago

Okay, I tried both named implementations on our code base. It turns out that the UTF8 version is wrong. I added a test for this. Because it turns every byte into a string it breaks with scalars that consists of multiple bytes.

robb / Swim

Linux Support #26