Closed randomdross closed 4 years ago
Hi @randomdross, thanks for doing this!
Is this done via the BOM or sniffing?
Both. It checks for a BOM first. If found, it uses it. If there's no BOM, it looks at the pattern of NULs in the first code unit. Every JSON blob must start with an ASCII character (as all JSON syntax is ASCII), so UTF-32LE, UTF-32BE, UTF-16LE, and UTF-16BE can all be detected exactly by looking at the first 4 or 2 bytes. If It doesn't detect it as UTF-32 or UTF-16 it assumes UTF-8.
What happens if malicious data within JSON includes a BOM in order to trick the parser into parsing data as JSON that should just be considered to be data?
Data within a JSON blob can't have a BOM, the BOM must be at the start of the overall Data
object.
If you're asking "what happens if I stick a BOM on garbage data", well, the same thing that happens if you use garbage data without a BOM, the only thing the BOM does is determine whether we try to decode the data as UTF-16/UTF-32 or as UTF-8. Using a BOM to decode garbage data as UTF-16/UTF-32 will produce the same overall effect as decoding the garbage data as UTF-8.
How is malformed UTF-8 handled?
Any sequence of bytes that does not form a valid code unit in the chosen encoding produces a U+FFFD REPLACEMENT CHARACTER. If this occurs within a JSON string, that's fine (as that's a legal scalar to find within a string), but outside of a string, it will be rejected by the tokenizer as a syntax error.
Ultimately, the UTF-8/UTF-16/UTF-32 decoding is implemented by the Swift standard library's built-in UTF8
/UTF16
/UTF32
decoders.
Is there normalization? E.g.: Could a double-width quote in data be normalized back to a quote somewhere, enabling data that’s supposed to be encoded within JSON to actually affect the JSON structure as understood by the parser?
No, there is no normalization performed. A U+FF02 FULLWIDTH QUOTATION MARK in the JSON source, if found inside a JSON string will be interpreted as part of the string's contents, and if found outside a JSON string will be interpreted as a syntax error. Swift never automatically normalizes; in order to normalize strings you need to access the relevant Foundation property (e.g. decomposedStringWithCompatibilityMapping
) and PMJSON does not do this.
Usage of various things marked “unsafe”
UnsafeMutableRawPointer
is the equivalent of using void *
. UnsafeBufferPointer<CChar>
is the equivalent of a const char *
coupled with a length. These have "unsafe" in the name because Swift can't prove that you're using them correctly (as they're ultimately just C pointers).
The usage in Parser.swift
is to try and avoid having NSString
make a copy of the string data, given that the NSString
instance isn't escaping the function (it's just being used to construct a Decimal
, which does not retain the string data). The safety is ensured through the use of UnsafeBufferPointer
(which carries a length, so we don't overrun the buffer), and Swift's implicit conversion from a ContiguousArray
at the caller into an UnsafeBufferPointer
(meaning the actual source data is managed by Swift in an array, it's just temporarily converted to an UnsafeBufferPointer
so we can call the NSString(bytesNoCopy:length:encoding:freeWhenDone:)
method).
The usage in Foundation.swift
is to construct an iterator over the NSData
's data. NSData
does not provide its own iterator, but it does provide the bytes
property, which is an UnsafeRawPointer
(a const void *
). We use that plus the NSData
's count
to produce an UnsafeBufferPointer
, which acts as a Collection
that provides access to the bytes. Or to put it another way, we're wrapping a memory-safe collection around NSData
. So despite the name "Unsafe" we're actually adding safety.
Data within a JSON blob can't have a BOM, the BOM must be at the start of the overall
Data
object. Great. I'd assert that if an attacker has control of the overallData
object then they already control the JSON content as a whole. So this is fine.
Everything else here looks good too. Re Swift (again, apologies for being a Swift newb...) it sounds like you're saying that PMJSON uses some functionality that is not guaranteed to be memory-safe, but it does so in a straightforward way that is understood to be safe. If so, you may want to consider implementing some fuzzing just in case the code changes in the future, or in case there's an obscure bug that may have been overlooked.
you're saying that PMJSON uses some functionality that is not guaranteed to be memory-safe, but it does so in a straightforward way that is understood to be safe.
Pretty much. It uses some code that the Swift compiler cannot ensure is memory-safe, so those snippets of code are roughly equivalent to the safety of C code, but this not-guaranteed-to-be-safe code has a very limited scope and it's pretty easy for the programmer to prove that it's safe.
I'm going to go ahead and close this. Feel free to reopen if you have any more concerns.
I did a quick security-focused review and these are my notes. PTAL and assess if there's anything that makes sense to follow up on.
“A convenience method is also provided for decoding from a Data containing data encoded as UTF-8, UTF-16, or UTF-32”
Usage of various things marked “unsafe”... Not immediately sure if these are being used in a safe manner: https://github.com/postmates/PMJSON/search?q=unsafe&unscoped_q=unsafe ...or the extent to which this matters, given the properties of swift w.r.t. memory safety.
Sorry for not being more specific on a lot of this -- I'm new to swift. I think it all of this may just translate to having appropriate test cases, such as using the BOM inside JSON data being decoded, etc. as per above.