Closed nirvdrum closed 2 years ago
I see the same issue with a TruffleString with a UTF-8 encoding and ASCII code range matching against a TRegex with a US-ASCII encoding. Perhaps I'm just using exec
wrong? The context here is I'm changing TruffleRuby over to using TruffleString. Since we don't currently use TruffleString, we call the execBytes
TRegex message. With TruffleString, I thought we could use the exec
message instead, but that's not working out the way I'd have thought.
Hi, thank you for reporting this, we're going to take a look into this and get back to you
Probably one needs a SwitchEncodingNode there, could you confirm @djoooooe? It'll probably noop in most/all cases since we check earlier that the combination of the regexp encoding and the string encoding makes sense and the string isn't broken.
There is definitely a somewhat-related bug in TRegex though, because:
String regex = "Flavor=Ruby,Encoding=" + tRegexEncoding + ignoreAtomicGroups + "/" + processedRegexpSource +
"/" + flags;
Source regexSource = Source
.newBuilder("regex", regex, "Regexp")
.mimeType("application/tregex")
.internal(true)
.build();
Object compiledRegex = context.getEnv().parseInternal(regexSource).call();
regex = "Flavor=Ruby,Encoding=BYTES/./"
compiledRegex.source.options=Flavor=Ruby,Encoding=LATIN-1
And that then causes TRegex to try to access the matched string in ISO_8859_1 encoding, but it should be BYTES encoding. cc @jirkamarsik
It seems com.oracle.truffle.regex.tregex.string.Encodings.Encoding doesn't have BYTES, only UTF* and US-ASCII. It'd be great if TRegex can support any TruffleString.Encoding, but at the very least we need BYTES.
Probably one of these: https://github.com/oracle/graal/blob/587c31f311b09ba9e398e182b8e3a6bcf832679c/regex/src/com.oracle.truffle.regex/src/com/oracle/truffle/regex/tregex/string/Encodings.java#L81-L83 https://github.com/oracle/graal/blob/587c31f311b09ba9e398e182b8e3a6bcf832679c/regex/src/com.oracle.truffle.regex/src/com/oracle/truffle/regex/RegexOptions.java#L507-L509
Hi, sorry for the late response... I cannot reproduce your issue locally, are you running with the strict encoding check debug option? If so, you would have to use a SwitchEncodingNode
before calling into TRegex, as it currently does not coerce strings into the expected encoding.
@djoooooe Yes, @nirvdrum's original issue is addressed by SwitchEncodingNode on the string to match before matching it. OTOH, https://github.com/oracle/graal/issues/4588#issuecomment-1140428808 is a separate issue which happens regardless of the strict encoding check.
@eregon Thanks for the clarification. I already added a fix for https://github.com/oracle/graal/issues/4588#issuecomment-1140428808 to my next PR.
Describe the issue
I'm running into a situation where a
AbstractTruffleString#checkEncoding
call is failing where I think it should pass. In Ruby, it's somewhat common to take a string and give it a "binary" encoding at I/O boundaries. For example, I'm working with a string that has aTruffleString.Encoding.US_ASCII
encoding and aTruffleString.CodeRange.ASCII
code range value, but is then converted to a string with a RubyBINARY
encoding, which translates toTruffleString.Encoding.BYTES
.checkEncoding
will fail saying thatUS_ASCII
andBYTES
are not compatible, but they should be compatible based on the code range.Steps to reproduce the issue Please include both build steps as well as run steps
TruffleString.CodeRange.ASCII
code range andTruffleString.CodeRange.BYTES
encoding.exec
messageDescribe GraalVM and your environment: