oracle / truffleruby

A high performance implementation of the Ruby programming language, built on GraalVM.
https://www.graalvm.org/ruby/
Other
3k stars 184 forks source link

[interop] Cannot convert a Ruby String with BINARY encoding containing non-US-ASCII character 226 to a Java String #1722

Open fniephaus opened 5 years ago

fniephaus commented 5 years ago

Repro

GraalVM MultiLanguage Shell 19.0.0
Copyright (c) 2013-2019, Oracle and/or its affiliates
  JavaScript version 19.0.0
  Ruby version 2.6.2
js> ruby>
ruby> Polyglot.export('string', 226.chr)
Internal error occured: org.graalvm.polyglot.PolyglotException: org.truffleruby.language.control.RaiseException: Cannot convert a Ruby String with BINARY encoding containing non-US-ASCII character 226 to a Java String (CannotConvertBinaryRubyStringToJavaString)
    from org.truffleruby.core.rope.RopeOperations.decodeNonAscii(RopeOperations.java:167)
    from org.truffleruby.core.rope.RopeOperations.decodeRopeSegment(RopeOperations.java:223)
    from org.truffleruby.core.rope.RopeOperations.decodeRopeSegment(RopeOperations.java:216)
    from org.truffleruby.core.rope.RopeOperations.decodeRope(RopeOperations.java:212)
    from org.truffleruby.core.string.StringOperations.getString(StringOperations.java:60)
    from org.truffleruby.interop.ToJavaStringNodeGen.executeAndSpecialize(ToJavaStringNodeGen.java:186)
    from org.truffleruby.interop.ToJavaStringNodeGen.executeToJavaString(ToJavaStringNodeGen.java:96)
    from org.truffleruby.interop.RubyToForeignNode.convertString(RubyToForeignNode.java:31)
    from org.truffleruby.interop.RubyToForeignNodeGen.executeAndSpecialize(RubyToForeignNodeGen.java:56)
    from org.truffleruby.interop.RubyToForeignNodeGen.executeConvert(RubyToForeignNodeGen.java:40)
Translated to internal error (RuntimeError)
Run with --verbose to see the full stack trace.

Expected Behavior

GraalVM MultiLanguage Shell 19.0.0
Copyright (c) 2013-2019, Oracle and/or its affiliates
  JavaScript version 19.0.0
  Ruby version 2.6.2
js> Polyglot.export('string', String.fromCharCode(226));
â
chrisseaton commented 5 years ago

BINARY is really ASCII-8BIT, but which interpretation of the 8th bit do you want?

According to Wikipedia, 226 in various 8 bit vaiants could mean any of âтقβגโā. How do we pick?

chrisseaton commented 5 years ago

MRI doesn't know what you want either:

irb(main):001:0> 226.chr
=> "\xE2"
eregon commented 5 years ago

I think the solution here is to just pass an Encoding to Integer#chr:

226.chr(Encoding::UTF_8)

Then it should work fine to export it. @fniephaus Could you try that?

For bytes below 128, Integer#chr uses US-ASCII, but above it uses the BINARY encoding by default, so an explicit encoding needs to be specified in that case.

eregon commented 5 years ago

Yes, that works in GraalVM:

ruby> Polyglot.export('string', 226.chr(Encoding::UTF_8))
"â"

@fniephaus Is that OK for your use case?

fniephaus commented 5 years ago

Our problem occurs when parsing a rather long text file (118MB) and causes an error. The workaround should do for now, thanks! Nonetheless, the example also shows that TruffleRuby violates the interop contract: interop's isString on 226.chr returns true and then asString fails. I don't know what the correct behavior is in terms of interop, maybe @chumer could clarify that? In particular, what encoding should interop strings have? I think utf-something would be a reasonable choice.

chrisseaton commented 5 years ago

Java strings don't have an encoding - that's an implementation detail. They do have a character set, which is Unicode.

I see what you mean about the interop contract though - we probably should not return isString if we can't convert.

Maybe we should insert the replacement character (�) when we are asked to covert character that we can't understand?

eregon commented 5 years ago

Our problem occurs when parsing a rather long text file (118MB) and causes an error.

The encoding of the String should then be set as non-binary (e.g., File.read instead of File.binread), otherwise it's not representable as a Java String, only as a byte[].

Maybe we should insert the replacement character (�) when we are asked to covert character that we can't understand?

I think almost nobody wants that behavior, here the error reveals a problem when reading the file, so I argue this is behavior as intended.

chrisseaton commented 5 years ago

What about the interop contract though? Maybe binary strings should not be isString?

eregon commented 5 years ago

@chumer Is it a violation if asString throws?

fniephaus commented 5 years ago

@eregon IIRC there are assertions for that in TCK. Do you test TruffleRuby against TCK? If isString returns true, asString shouldn't fail.

eregon commented 5 years ago

@fniephaus We do test against the TCK, but probably we don't have an example for a binary Ruby String with non-7-bit characters. Making binary Ruby String with non-7-bit characters not isString would make some sense, but that sounds confusing for users. Maybe making all binary Ruby String non-isString would be easier to understand.

fniephaus commented 5 years ago

what about converting binary strings to byte[] and return a new Java string for that array? I think it’s also confusion that binary strings cannot be converted to char*/byte[] at all. It might not always makes sense but that’s the user’s problem maybe?

fniephaus commented 5 years ago

and of course +1 for adding an appropriate example to your TCK setup. :)

eregon commented 5 years ago

what about converting binary strings to byte[] and return a new Java string for that array?

It's not possible to "return a new Java string for that array" without specifying a Charset/Encoding, it's just impossible to know how to interpret the bytes correctly as Chris showed.

A binary Ruby String with non-7-bit characters is not "text" and it cannot be printed nor shown as characters, only as a sequence of bytes. So probably they shouldn't be "polyglot strings".

I think it’s also confusion that binary strings cannot be converted to char*/byte[] at all.

They can, by using String#bytes, isn't it?

fniephaus commented 5 years ago

Maybe you are right and the best fix atm is to not return true in isString if it's a binary string. It would certainly fix our use case. Nonetheless, I wonder if interop needs to support different string encodings.

eregon commented 3 years ago

This is an interesting question to revisit now that TruffleString is being worked on. We should check what is the behavior if one tries to get a j.l.String from a TruffleString with a binary encoding and non-7-bit bytes.

eregon commented 6 months ago

Now that there is TruffleString, asTruffleString() "works". asString() still fails, so probably we should do the same as asTruffleString().toString() or so.

irb
irb(main):001:0> Truffle::Interop.as_truffle_string "é".b
=> "??"
irb(main):002:0> Truffle::Interop.as_string "é".b

truffleruby: an internal exception escaped out of the interpreter,
please report it to https://github.com/oracle/truffleruby/issues  

Cannot convert a Ruby String with BINARY encoding containing non-US-ASCII character 195 to a Java String (org.truffleruby.core.string.CannotConvertBinaryRubyStringToJavaString)