rubymotion-community / motion-support

Commonly useful extensions to the standard library for RubyMotion
MIT License
132 stars 28 forks source link

Handle array access syntax on strings with emoji #34

Open jbender opened 8 years ago

jbender commented 8 years ago

Emojis pose a problem for array-like access of a string. If you try to grab one register you'll get am error: "You can't cut a surrogate in two in an encoding that is not UTF-16 (IndexError)" Calling split(''), which splits the string into array of characters, works correctly even with emojis. So to make this method work as expected for strings, first split it then join.

colinta commented 8 years ago

I'm nervous about overriding String#[], but not opposed. Maybe we could add some more specs in there, to make sure the default behavior is "as expected?" (I didn't look in the existing specs to see if those are already tested, I just looked at the diff)

tkadauke commented 8 years ago

Few questions:

1) How is the performance compared to vanilla String#[]? 2) Have you considered adding a new method (e.g. utf8_char_at, or maybe just at)? 3) Maybe it's a good idea to subclass String instead (UTF8String?) and then override the [] operator? 4) Shouldn't this be fixed in RubyMotion instead?

jbender commented 8 years ago

@tkadauke can't speak to the performance of it except to say that I've been using it in production for a few months now with no complaints. Do you have any specific tests you'd like to perform on it?

I'm a proponent of the "it should just work" principle, so I'd be opposed to making this its own method. In doing so you're forcing people to know if a string may contain an emoji at any point rather than just making sure they're always safe.

It'd be perfectly reasonable for this to be handled by RubyMotion itself (indeed probably preferred), but I happen to have an inside track to fix it here so thought I'd propose. 😁

tkadauke commented 8 years ago

Sorry to not get back to you in a long time. It just occurred to me that as a compromise, we can make this opt-in. E.g. we can have a class method on String like:

class String
  def self.enable_emoji_support
    @@emoji_support = true
  end

  def [](*args)
    return bracket_access_original(*args) unless @@emoji_support
    # ...
  end
end

The reason for that is a behavior change in getting the n-th character from a string can lead to catastrophic results. Ruby was plagued with this ever since Unicode encodings became popular.