swiftlang / swift

The Swift Programming Language
https://swift.org
Apache License 2.0
67.3k stars 10.34k forks source link

[SR-16084] Substring.UnicodeScalarView.replaceSubrange doesn't correctly handle bridged strings #58343

Open lorentey opened 2 years ago

lorentey commented 2 years ago
Previous ID SR-16084
Radar rdar://91074020
Original Reporter @lorentey
Type Bug
Environment - All shipping Swift releases since at least 5.0 - Swift main as of 2022-03-30
Additional Detail from JIRA | | | |------------------|-----------------| |Votes | 0 | |Component/s | Standard Library | |Labels | Bug | |Assignee | None | |Priority | Medium | md5: 18f92a6c21179306e3e4f10b69940880

Issue Description:

import Foundation

let str = "\u\{1F4A5}\u\{1F4A5}\u\{1F4A5} Hello!" as NSString as String
let scalars = str.unicodeScalars

let i = scalars.firstIndex(of: "H")!
let j = scalars.firstIndex(of: "!")!
var sub = scalars[i ..< j]
sub.replaceSubrange(j ..< j, with: ", world".unicodeScalars)
// `sub` should now contain "Hello, world"

print(Array(sub))
// ["\u\{00A3}", "\u\{0001F4A5}", " ", "H", "e", "l", "l", "o", ","]

print(String(sub)) // "\x\{A3}\u\{1F4A5} Hello,"
print(Substring(sub).utf8.map \{ String($0, radix: 16) })
// ["a3", "f0", "9f", "92", "a5", "20", "48", "65", "6c", "6c", "6f", "2c"]
// 0xA3 is a UTF-8 continuation byte.
lorentey commented 2 years ago

@swift-ci create

lorentey commented 2 years ago

This is because one of the `replaceSubrange` implementations on the standard, generic `Slice` type assumes that indices before the replaced range remain valid after the replacement.

`replaceSubrange` is documented to invalidate all indices, no matter where they are in relation to the replaced range.

In this particular case, the bridged string measures index offsets in UTF-16 code units, but it gets converted to a UTF-8 native string as a side effect of the mutation. `Slice.replaceSubrange` fails to update the `startIndex`, and the old index offset gets reinterpreted in UTF-8.

`Slice.replaceSubrange` needs to always recalculate the bounds from scratch. (This will sadly regress performance in the usual case, but it is unavoidable for correctness.)