ruby / strscan

Provides lexical scanning operations on a String.
BSD 2-Clause "Simplified" License
81 stars 32 forks source link

In JRuby, StringScanner.new("") can only hold `Encoding:US-ASCII` encoding. #78

Closed naitoh closed 9 months ago

naitoh commented 9 months ago

No Problem case (Ruby 3.3.0) šŸ™†ā€ā™‚ļø

$ ruby -v
ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [arm64-darwin22]
$ gem list strscan

*** LOCAL GEMS ***

strscan (3.0.8, default: 3.0.7)
$ irb
> require 'strscan'
=> true
> s = StringScanner.new("test")
=> #<StringScanner 0/4 @ "test">
> s.rest.encoding
=> #<Encoding:UTF-8>
> s = StringScanner.new("")
=> #<StringScanner fin>
> s.rest.encoding
=> #<Encoding:UTF-8>
> s.string.force_encoding("ASCII-8BIT")
=> ""
> s.rest.encoding
=> #<Encoding:ASCII-8BIT>
 s.string.force_encoding("UTF-8")
=> ""
> s.rest.encoding
=> #<Encoding:UTF-8>

Problem case (JRuby 9.4.5.0 ) šŸ™…

$ ruby -v
jruby 9.4.5.0 (3.1.4) 2023-11-02 1abae2700f Java HotSpot(TM) 64-Bit Server VM 25.121-b13 on 1.8.0_121-b13 +jit [x86_64-darwin]
$ gem list strscan

*** LOCAL GEMS ***

strscan (3.0.8 java, default: 3.0.7 java)
$ irb
> require 'strscan'
=> true
> s = StringScanner.new("test")
=> #<StringScanner 0/4 @ "test">
> s.rest.encoding
=> #<Encoding:UTF-8>
> s = StringScanner.new("")
=> #<StringScanner fin>
> s.rest.encoding
=> #<Encoding:US-ASCII>
> s.string.force_encoding("UTF-8")
=> ""
> s.rest.encoding
=> #<Encoding:US-ASCII>

StringScanner.new("") can only hold Encoding:US-ASCII encoding.

The above causes the following differences in behavior.

> s = StringScanner.new("")
=> #<StringScanner fin>
> s.string = s.rest + "test"
=> "test"
> s.rest.encoding
=> #<Encoding:UTF-8>
> s = StringScanner.new("")
=> #<StringScanner fin>
> s.string = s.rest + "test"
=> "test"
> s.rest.encoding
=> #<Encoding:US-ASCII>

The following appear to be unaffected.

> s = StringScanner.new("")
=> #<StringScanner fin>
> s << "test"
=> #<StringScanner 0/4 @ "test">
> s.rest.encoding
=> #<Encoding:UTF-8>
> s = StringScanner.new("")
=> #<StringScanner fin>
> s << "test"
=> #<StringScanner 0/4 @ "test">
> s.rest.encoding
=> #<Encoding:UTF-8>
headius commented 9 months ago

I'll look into it.

headius commented 9 months ago

The problem appears to be in StringScanner#rest. When the effective "rest" length is zero, it always creates a new string with encoding "US-ASCII".

https://github.com/ruby/strscan/blob/1fbfdd3c6fa4550b47eaa0d014dfb251538dcac8/ext/jruby/org/jruby/ext/strscan/RubyStringScanner.java#L773C31-L773C31

I have a patch in progress that appears to fix the cases provided by @naitoh.

@naitoh Could you create a PR with a few test cases?

naitoh commented 9 months ago

@headius

I have a patch in progress that appears to fix the cases provided by @naitoh.

Thank you!

@naitoh Could you create a PR with a few test cases?

I have created #80 which adds a test case.

headius commented 9 months ago

Fixed by #79. Tests in #80 for @mrkn to decide on.