Open fniephaus opened 5 years ago
BINARY
is really ASCII-8BIT
, but which interpretation of the 8th bit do you want?
According to Wikipedia, 226 in various 8 bit vaiants could mean any of âтقβגโā
. How do we pick?
MRI doesn't know what you want either:
irb(main):001:0> 226.chr
=> "\xE2"
I think the solution here is to just pass an Encoding to Integer#chr
:
226.chr(Encoding::UTF_8)
Then it should work fine to export it. @fniephaus Could you try that?
For bytes below 128, Integer#chr
uses US-ASCII, but above it uses the BINARY encoding by default, so an explicit encoding needs to be specified in that case.
Yes, that works in GraalVM:
ruby> Polyglot.export('string', 226.chr(Encoding::UTF_8))
"â"
@fniephaus Is that OK for your use case?
Our problem occurs when parsing a rather long text file (118MB) and causes an error. The workaround should do for now, thanks! Nonetheless, the example also shows that TruffleRuby violates the interop contract: interop's isString
on 226.chr
returns true
and then asString
fails. I don't know what the correct behavior is in terms of interop, maybe @chumer could clarify that? In particular, what encoding should interop strings have? I think utf-something would be a reasonable choice.
Java strings don't have an encoding - that's an implementation detail. They do have a character set, which is Unicode.
I see what you mean about the interop contract though - we probably should not return isString
if we can't convert.
Maybe we should insert the replacement character (�) when we are asked to covert character that we can't understand?
Our problem occurs when parsing a rather long text file (118MB) and causes an error.
The encoding of the String should then be set as non-binary (e.g., File.read
instead of File.binread
), otherwise it's not representable as a Java String, only as a byte[]
.
Maybe we should insert the replacement character (�) when we are asked to covert character that we can't understand?
I think almost nobody wants that behavior, here the error reveals a problem when reading the file, so I argue this is behavior as intended.
What about the interop contract though? Maybe binary strings should not be isString
?
@chumer Is it a violation if asString throws?
@eregon IIRC there are assertions for that in TCK. Do you test TruffleRuby against TCK? If isString returns true, asString shouldn't fail.
@fniephaus We do test against the TCK, but probably we don't have an example for a binary Ruby String with non-7-bit characters.
Making binary Ruby String with non-7-bit characters not isString
would make some sense, but that sounds confusing for users. Maybe making all binary Ruby String non-isString
would be easier to understand.
what about converting binary strings to byte[] and return a new Java string for that array? I think it’s also confusion that binary strings cannot be converted to char*/byte[] at all. It might not always makes sense but that’s the user’s problem maybe?
and of course +1 for adding an appropriate example to your TCK setup. :)
what about converting binary strings to byte[] and return a new Java string for that array?
It's not possible to "return a new Java string for that array" without specifying a Charset/Encoding, it's just impossible to know how to interpret the bytes correctly as Chris showed.
A binary Ruby String with non-7-bit characters is not "text" and it cannot be printed nor shown as characters, only as a sequence of bytes. So probably they shouldn't be "polyglot strings".
I think it’s also confusion that binary strings cannot be converted to char*/byte[] at all.
They can, by using String#bytes
, isn't it?
Maybe you are right and the best fix atm is to not return true in isString if it's a binary string. It would certainly fix our use case. Nonetheless, I wonder if interop needs to support different string encodings.
This is an interesting question to revisit now that TruffleString is being worked on. We should check what is the behavior if one tries to get a j.l.String from a TruffleString with a binary encoding and non-7-bit bytes.
Now that there is TruffleString, asTruffleString() "works". asString() still fails, so probably we should do the same as asTruffleString().toString() or so.
irb
irb(main):001:0> Truffle::Interop.as_truffle_string "é".b
=> "??"
irb(main):002:0> Truffle::Interop.as_string "é".b
truffleruby: an internal exception escaped out of the interpreter,
please report it to https://github.com/oracle/truffleruby/issues
Cannot convert a Ruby String with BINARY encoding containing non-US-ASCII character 195 to a Java String (org.truffleruby.core.string.CannotConvertBinaryRubyStringToJavaString)
Repro
Expected Behavior