GraalVM version or commit id if built from source: 22.1.0
CE or EE: CE
JDK version: 17
OS and OS Version: macOS 12.3.1
Architecture: amd64 via Rosetta
Describe the issue
TRegex has the ability to dump the automata corresponding to a regex. This is an amazingly useful way to gain insight into what the engine is doing and can help in debugging. However, I find the transition labels to be rather difficult to read in some cases. For example, working with the Ruby regex /a?a?aa/, some of the DFA transitions are labeled as [x00-x60b-x7f].
The primary issues I have reading it are:
The code point ranges aren't separated by a character, so they blend together
Some values are displayed as hexadecimal while others are the printed character
The transition labeled [x00-x60b-x7f] would be a lot clearer to me if it were presented as [x00-x60,x62-x7f]. I appreciate there may be some difficulty in using a delimiter if you also print literal comma characters as part of the set. But, since "b" is a hexadecimal character and code points can be multiple bytes, I was rather confused by x60b. Additionally, since three out of the four range bounds are presented in hexadecimal, showing a literal b character is not ideal. If it were presented as x62, it'd be immediately clear that this range represents every ASCII character except for x60. When presented as b, I need to consult an ASCII table separately to really understand what the range is.
Code snippet or code repository that reproduces the issue
jt ruby -e 'p 100_000.times { /a?a?aa/.match?("aaa") }'
Unfortunately, to actually get the output you'll need to modify TruffleRuby to add the TRegex DumpAutomata option. I'm going to add a new option to TruffleRuby to handle that without having to recompile TruffleRuby.
Describe GraalVM and your environment :
Describe the issue TRegex has the ability to dump the automata corresponding to a regex. This is an amazingly useful way to gain insight into what the engine is doing and can help in debugging. However, I find the transition labels to be rather difficult to read in some cases. For example, working with the Ruby regex
/a?a?aa/
, some of the DFA transitions are labeled as[x00-x60b-x7f]
.The primary issues I have reading it are:
The transition labeled
[x00-x60b-x7f]
would be a lot clearer to me if it were presented as[x00-x60,x62-x7f]
. I appreciate there may be some difficulty in using a delimiter if you also print literal comma characters as part of the set. But, since "b" is a hexadecimal character and code points can be multiple bytes, I was rather confused byx60b
. Additionally, since three out of the four range bounds are presented in hexadecimal, showing a literalb
character is not ideal. If it were presented asx62
, it'd be immediately clear that this range represents every ASCII character except forx60
. When presented asb
, I need to consult an ASCII table separately to really understand what the range is.Code snippet or code repository that reproduces the issue
Unfortunately, to actually get the output you'll need to modify TruffleRuby to add the TRegex
DumpAutomata
option. I'm going to add a new option to TruffleRuby to handle that without having to recompile TruffleRuby.