ronin-rb / ronin-support

A support library for Ronin. Like activesupport, but for hacking!
https://ronin-rb.dev
GNU Lesser General Public License v3.0
27 stars 10 forks source link

`Encoding::JS.unescape` cannot parse Unicode surrogate pairs #519

Closed postmodern closed 4 months ago

postmodern commented 4 months ago

Currently Encoding::JS.unescape will raise an EncodingError when it tries to parse Unicode surrogate character pair which often occur in JavaScript strings containing emoji characters. The StringScanner algorithm should be adjusted to identify when the first escaped unicode codepoint starts with \uD0.. , \uD8.., \uD9.., \uDA.., \uDB.., and the second escaped unicode codepoint starts with \uDC.., \uDD.., \uDE.., \uDF...

Example

"\uD83D\uDE80"

aka '🚀'

Example Solution

'"hello world! \\ud83d\\ude01"'
  .gsub(/
    \\u(d[890ab]\h\h)
    \\u(d[cdef]\h\h)
  /ix) {
    hi, lo = $1, $2
    (0x1_0000 +
      (Integer(hi, 16) - 0xd800) * 0x400 +
      (Integer(lo, 16) - 0xdc00))
    .chr("UTF-8")
  }
# => "\"hello world! \""

https://ruby.social/@nick_evans/112776837324476279