opal / opal

Ruby ♥︎ JavaScript
https://opalrb.com
MIT License
4.84k stars 330 forks source link

invalid multibyte escape: #1509

Open h4ck3rm1k3 opened 8 years ago

h4ck3rm1k3 commented 8 years ago

invalid multibyte escape: /[\x00-\x7F]|[\x80-\xBF][\xC0-\xF0]*|[\xC0-\xF0]/ (RegexpError)

I am not able to create an isolated test case yet, but I think it happens on this line : https://github.com/rubysl/rubysl-rexml/blob/0ab7aae8d824606dc41e855445b1b993c25e9285/lib/rexml/text.rb#L142

/home/mdupont/experiments/opal/lib/opal/nodes/literal.rb:121:in `initialize': invalid multibyte escape: /[\x00-\x7F]|[\x80-\xBF][\xC0-\xF0]*|[\xC0-\xF0]/ (RegexpError)
    from /home/mdupont/experiments/opal/lib/opal/nodes/literal.rb:121:in `new'
    from /home/mdupont/experiments/opal/lib/opal/nodes/literal.rb:121:in `compile_static_regexp'
    from /home/mdupont/experiments/opal/lib/opal/nodes/literal.rb:100:in `compile'
    from /home/mdupont/experiments/opal/lib/opal/nodes/base.rb:190:in `compile_to_fragments'
    from /home/mdupont/experiments/opal/lib/opal/compiler.rb:310:in `process'
    from /home/mdupont/experiments/opal/lib/opal/nodes/base.rb:251:in `expr'
    from /home/mdupont/experiments/opal/lib/opal/nodes/arglist.rb:14:in `block in compile'
    from /home/mdupont/experiments/opal/lib/opal/nodes/arglist.rb:12:in `each'
    from /home/mdupont/experiments/opal/lib/opal/nodes/arglist.rb:12:in `compile'
ggrossetie commented 6 years ago

I can reproduce this issue: bundle exec opal -e "gsub(/[\x80-\xff]/n, '')" MRI:

2.4.1 :001 > 'yo'.gsub(/[\x80-\xff]/n, '')
 => "yo"

I think the root cause is that n flag is skipped because it's not widely supported by JavaScript vendors. In this case we need n flag to change encoding to ASCII-8BIT: http://ruby-doc.org/core-2.5.0/Regexp.html#class-Regexp-label-Encoding

iliabylich commented 6 years ago

No, the issue here is that \x80 is not a valid utf8 character. You can parse it by adding a # encoding: ascii-8bit comment to the beginning of your file. I don't know why MRI parses it.

2.4.0 :001 > "\x80".encoding
 => #<Encoding:UTF-8>
2.4.0 :002 > "\x80".valid_encoding?
 => false
ggrossetie commented 6 years ago

Thanks @iliabylich for your input.

The code is from ttfunk: https://github.com/prawnpdf/ttfunk/blob/086b3126b13d207abf992279bef9b7699af8ae32/lib/ttfunk/table/name.rb#L20

I believe \x80-\xFF are non-ASCII character ranges: http://www.unicode.org/charts/PDF/U0080.pdf (C1 Controls and Latin-1 Supplement)

iliabylich commented 6 years ago

@Mogztter Yes, you are right. To parse this file you need to add an encoding comment.

hmdne commented 3 years ago

Related to #2235 - the 5aad139c7fcc92f4b5f7bd4412987843db535698 commit.