oracle / graal

GraalVM compiles Java applications into native executables that start instantly, scale fast, and use fewer compute resources 🚀
https://www.graalvm.org
Other
20.34k stars 1.63k forks source link

TruffleString: BYTES with CodeRange.ASCII incompatible with US_ASCII encoding #4588

Closed nirvdrum closed 2 years ago

nirvdrum commented 2 years ago

Describe the issue

I'm running into a situation where a AbstractTruffleString#checkEncoding call is failing where I think it should pass. In Ruby, it's somewhat common to take a string and give it a "binary" encoding at I/O boundaries. For example, I'm working with a string that has a TruffleString.Encoding.US_ASCII encoding and a TruffleString.CodeRange.ASCII code range value, but is then converted to a string with a Ruby BINARY encoding, which translates to TruffleString.Encoding.BYTES. checkEncoding will fail saying that US_ASCII and BYTES are not compatible, but they should be compatible based on the code range.

Steps to reproduce the issue Please include both build steps as well as run steps

  1. Create a TruffleString object with a TruffleString.CodeRange.ASCII code range and TruffleString.CodeRange.BYTES encoding.
  2. Create a TRexec object with an ASCII encoding.
  3. Pass that created string as the match string object and the created regex to the TRegex exec message

Describe GraalVM and your environment:

nirvdrum commented 2 years ago

I see the same issue with a TruffleString with a UTF-8 encoding and ASCII code range matching against a TRegex with a US-ASCII encoding. Perhaps I'm just using exec wrong? The context here is I'm changing TruffleRuby over to using TruffleString. Since we don't currently use TruffleString, we call the execBytes TRegex message. With TruffleString, I thought we could use the exec message instead, but that's not working out the way I'd have thought.

oubidar-Abderrahim commented 2 years ago

Hi, thank you for reporting this, we're going to take a look into this and get back to you

eregon commented 2 years ago

Probably one needs a SwitchEncodingNode there, could you confirm @djoooooe? It'll probably noop in most/all cases since we check earlier that the combination of the regexp encoding and the string encoding makes sense and the string isn't broken.

eregon commented 2 years ago

There is definitely a somewhat-related bug in TRegex though, because:

        String regex = "Flavor=Ruby,Encoding=" + tRegexEncoding + ignoreAtomicGroups + "/" + processedRegexpSource +
                "/" + flags;
        Source regexSource = Source
                .newBuilder("regex", regex, "Regexp")
                .mimeType("application/tregex")
                .internal(true)
                .build();
        Object compiledRegex = context.getEnv().parseInternal(regexSource).call();
regex = "Flavor=Ruby,Encoding=BYTES/./"
compiledRegex.source.options=Flavor=Ruby,Encoding=LATIN-1

And that then causes TRegex to try to access the matched string in ISO_8859_1 encoding, but it should be BYTES encoding. cc @jirkamarsik

It seems com.oracle.truffle.regex.tregex.string.Encodings.Encoding doesn't have BYTES, only UTF* and US-ASCII. It'd be great if TRegex can support any TruffleString.Encoding, but at the very least we need BYTES.

eregon commented 2 years ago

Probably one of these: https://github.com/oracle/graal/blob/587c31f311b09ba9e398e182b8e3a6bcf832679c/regex/src/com.oracle.truffle.regex/src/com/oracle/truffle/regex/tregex/string/Encodings.java#L81-L83 https://github.com/oracle/graal/blob/587c31f311b09ba9e398e182b8e3a6bcf832679c/regex/src/com.oracle.truffle.regex/src/com/oracle/truffle/regex/RegexOptions.java#L507-L509

djoooooe commented 2 years ago

Hi, sorry for the late response... I cannot reproduce your issue locally, are you running with the strict encoding check debug option? If so, you would have to use a SwitchEncodingNode before calling into TRegex, as it currently does not coerce strings into the expected encoding.

eregon commented 2 years ago

@djoooooe Yes, @nirvdrum's original issue is addressed by SwitchEncodingNode on the string to match before matching it. OTOH, https://github.com/oracle/graal/issues/4588#issuecomment-1140428808 is a separate issue which happens regardless of the strict encoding check.

djoooooe commented 2 years ago

@eregon Thanks for the clarification. I already added a fix for https://github.com/oracle/graal/issues/4588#issuecomment-1140428808 to my next PR.