swiftcsv / SwiftCSV

CSV parser for Swift
MIT License
947 stars 190 forks source link

enumerateAsDict crashing on ios15 when <U+FEFF> Zero Width No-Break Space missing (Byte Order Mark) #97

Closed varun-s closed 2 years ago

varun-s commented 2 years ago

First column is always returning nil

enumerateAsDict crashing on ios15 when <U+FEFF> Zero Width No-Break Space missing in start of file. Was working fine before iOS15.

Steps to reproduce:

  1. Download CSV from Airtable (We use Airtable.com for our CSV docs)
  2. Copy to xcode.
  3. Parse

Simple fix. Open the CSV in xcode and just delete the first column header and add it back. On doing this the <U+FEFF> Zero Width No-Break Space gets added to the start and parsing works fine again.

DivineDominion commented 2 years ago

Thanks for reporting!

That's weird, why is that character needed at all? 🤔

Could you write a minimal failing test in a PR as a base for a fix?

arno608rw commented 2 years ago

same print(dict["name"]) --> dict["\Ufeffname"]

DivineDominion commented 2 years ago

The addition of this character sounds very much like a bug on e.g. Airtable's end if they add the U+FEFF character.

Why Xcode would add that back in is a mystery to me @varun-s -- CSV should not require a leading non-breaking space.

lardieri commented 2 years ago

Not a bug in either Air Table or Xcode. The U+FEFF character is the "Byte Order Mark" and it is a long-established part of the Unicode standard.

In version 0.6.x, if you load a CSV file encoded in UTF-8 with a BOM, such as one generated by modern versions of Excel, then the U+FEFF winds up embedded as the first character of the first column header, rather than getting discarded by the parser as it should.

(lldb) po importedCSV.header[0].unicodeScalars
▿ StringUnicodeScalarView("Date")
  - 0 : "\u{FEFF}"
  - 1 : "D"
  - 2 : "a"
  - 3 : "t"
  - 4 : "e"

I can't currently try version 0.8.0 due to the breaking change removing .namedRows, which my code relies on. But I'm assuming the behavior hasn't changed. I can open a separate, linked issue if you would like.

DivineDominion commented 2 years ago

@lardieri Thanks for the pointer, didn't know that! I've been too dismissive of the issue too quickly it seems.

It sounds like it'd be most accurate to extract the BOM into CSV 'metadata' when parsing a string and/or file, then add it back it when re-serializing the CSV.

https://github.com/swiftcsv/SwiftCSV/blob/048a1d3c2950b9c151ef9364b36f91baadc2c28c/SwiftCSV/CSV.swift#L100-L102


I still don't know and would need to research whether the BOM is mandatory, recommended, or even widely supported.

Still odd that it crashed for @varun-s IMHO

lardieri commented 2 years ago

Lots of bandwidth and storage have been spent on that topic, and your favorite search engine can show you a sample. People's opinions of BOMs generally correlate with their opinions of Microsoft. As you're reading, please remember that when Unicode was invented, big-endian machines were much more prevalent on the internet than they are today. The Unicode standard was written to embrace big-endian byte order as the assumed default, which forced Microsoft to put BOMs on everything, since Windows has always been little-endian. That seems quaint today, when Mac, Linux, iOS, and Android are also little-endian, but it absolutely was a big deal 30 years ago.

The Unicode standard says that BOMs are allowed, but not required or encouraged, for UTF-8 files. The BOM of a UTF-8 file is always 0xEF, 0xBB, 0xBF regardless of the architecture of the machine. Microsoft chooses to put BOMs on their UTF-8 files, not because they provide any information about byte order, but in order to distinguish a UTF-8 file from any other "extended ASCII" file encoding, such as ISO Latin 1 / Windows 1252.

With that background out of the way...

I think we are looking at a bug, or at least a poorly-thought-out design, in Apple's standard library. A very long time ago, Apple had a String class method, string(contentsOfFile:), that would explicitly recognize the UTF-8 BOM and strip it. See:

https://developer.apple.com/documentation/foundation/nsstring/1497269-string

If there was no BOM, this method would assume the "default C encoding" — just as Windows applications would assume the default Windows code page, as noted above.

Apple quickly replaced this method with the string(contentsOf:encoding:) method that forces you to pass an explicit encoding argument. When you pass .utf16, .utf32, or .unicode encodings, Apple's string decoder recognizes the BOM and strips it out. Unfortunately, it doesn't do this for the .utf8 encoding. Instead, it decodes the BOM as its Unicode code point, \u{FEFF} and puts it as the first character of the string.

This means after your convenience initializer calls String(contents:encoding:), you have to look for the BOM and strip it manually before passing the string to your designated initializer. Something like this:

    var contents = try String(contentsOf: url, encoding: encoding)
    if contents.hasPrefix("\u{FEFF}") {
        contents.removeFirst()
    }

You don't need to remember the BOM in metadata or add the BOM back when serializing the file, because it's implied by the outgoing encoding. When you call dataUsingEncoding(.utf16) or dataUsingEncoding(.utf32) or dataUsingEncoding(.unicode), Apple puts the BOM on for you. No other encoding needs it, although UTF-8 might want it. Also, remember that a CSV might get deserialized from one encoding and then reserialized to a different encoding.

To support callers who want to serialize 100% Microsoft-compatible UTF-8 CSV files with BOMs, you could augment dataUsingEncoding(_:) somehow.

extension String.Encoding {
    static let utf8WithBOM = String.Encoding(rawValue: UInt(12345))
}

The serialization aspect of this issue is really an API design question, so it's up to you. But for the deserialization aspect, I'd recommend what I wrote above: look for the BOM just at the point of loading the string from the URL, and strip it there.