oneclick / rubyinstaller2

MSYS2 based RubyInstaller for Windows
https://rubyinstaller.org
BSD 3-Clause "New" or "Revised" License
644 stars 248 forks source link

Ruby chokes on Windows/Russian #348

Closed the-Arioch closed 6 months ago

the-Arioch commented 11 months ago

I wanted to wet my fit in AsciiDoc, not sure if i would need Ruby at all, maybe VSCode extension would be enough. But i thought, better safe than sorry, did winget install "ruby 3.2" and tried gem install asciidoc.

...actually, i just tried gem from powershell prompt.

Some background: being "classic" desktop dev i know zilch about Ruby, but can speak of Win32 API on "flat C API" level.

So, here we go:

PS C:\> gem
C:/Ruby32-x64/lib/ruby/3.2.0/rubygems.rb:1342:in `rescue in <top (required)>': U+2014 to IBM866 in conversion from UTF-16LE to UTF-8 to IBM866 (Encoding::UndefinedConversionError)
Loading the C:/Ruby32-x64/lib/ruby/3.2.0/rubygems/defaults/operating_system.rb file caused an error. This file is owned by your OS, not by rubygems upstream. Please find out which OS package this file belongs to and follow the guidelines from your OS to report the problem and ask for help.
        from C:/Ruby32-x64/lib/ruby/3.2.0/rubygems.rb:1328:in `<top (required)>'
        from <internal:gem_prelude>:2:in `require'
        from <internal:gem_prelude>:2:in `<internal:gem_prelude>'
C:/Ruby32-x64/lib/ruby/3.2.0/win32/registry.rb:910:in `encode': U+2014 to IBM866 in conversion from UTF-16LE to UTF-8 to IBM866 (Encoding::UndefinedConversionError)
        from C:/Ruby32-x64/lib/ruby/3.2.0/win32/registry.rb:910:in `export_string'
        from C:/Ruby32-x64/lib/ruby/3.2.0/win32/registry.rb:611:in `each_key'
        from C:/Ruby32-x64/lib/ruby/site_ruby/3.2.0/ruby_installer/runtime/msys2_installation.rb:71:in `block (2 levels) in iterate_msys_paths'
        from C:/Ruby32-x64/lib/ruby/3.2.0/win32/registry.rb:435:in `open'
        from C:/Ruby32-x64/lib/ruby/3.2.0/win32/registry.rb:542:in `open'
        from C:/Ruby32-x64/lib/ruby/site_ruby/3.2.0/ruby_installer/runtime/msys2_installation.rb:70:in `block in iterate_msys_paths'
        from C:/Ruby32-x64/lib/ruby/site_ruby/3.2.0/ruby_installer/runtime/msys2_installation.rb:68:in `each'
        from C:/Ruby32-x64/lib/ruby/site_ruby/3.2.0/ruby_installer/runtime/msys2_installation.rb:68:in `iterate_msys_paths'
        from C:/Ruby32-x64/lib/ruby/site_ruby/3.2.0/ruby_installer/runtime/msys2_installation.rb:102:in `msys_path'
        from C:/Ruby32-x64/lib/ruby/site_ruby/3.2.0/ruby_installer/runtime/msys2_installation.rb:115:in `mingw_bin_path'
        from C:/Ruby32-x64/lib/ruby/site_ruby/3.2.0/ruby_installer/runtime/msys2_installation.rb:125:in `enable_dll_search_paths'
        from C:/Ruby32-x64/lib/ruby/site_ruby/3.2.0/ruby_installer/runtime/singleton.rb:27:in `enable_dll_search_paths'
        from C:/Ruby32-x64/lib/ruby/3.2.0/rubygems/defaults/operating_system.rb:24:in `<top (required)>'
        from C:/Ruby32-x64/lib/ruby/3.2.0/rubygems.rb:1332:in `require'
        from C:/Ruby32-x64/lib/ruby/3.2.0/rubygems.rb:1332:in `<top (required)>'
        from <internal:gem_prelude>:2:in `require'
        from <internal:gem_prelude>:2:in `<internal:gem_prelude>'

I have Git on my pc, which works like a charm being built with the said MSYS2 runtime, so the problem is not there.

U+2014 is EmDash and of course can be reduced to DOC codepage as a simple ASCII7 "minus" U+002D, as it ever were in pre-IBM-PC times. That said, i am not sure it is ever needed.

Well, i tried to read the code...

msys2_installation.rb

      ].each do |reg_root, base_key|
        begin
          reg_root.open(backslachs(base_key)) do |reg|

If i read the diagnostic correctly, this is where it chokes.

There is if subreg['DisplayName'] =~ /^MSYS2 / later, but feels it never gets there.

For example i have VSCode installed (HKEY_CURRENT_USER\SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\{771FD6B0-FA20-440A-A002-3B3BAC16DC50}_is1) and i have Python (HKEY_CURRENT_USER\SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\{3d45edf4-44bb-483f-9e08-43c38c81e118}) with DisplayName set as Python 3.11.4 (64-bit) and even Ruby itself has dashes in the name HKEY_CURRENT_USER\SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\RubyInstaller-3.2-x64-mingw-ucrt_is1

Here is the access log, but nothign feels wrong there. Probably Ruby RTL first caches the dataset from registry, then iterates (and converts) that dataset to strings.

screenshot

Now...

ibm866 is GetOemCP or CP_OEMCP in Windows terms, a TUI (Text user interface) charset intended for non-graphic Windows apps. So the idea to convert it is generally wise, but in this specific place it feels misplaced.

Most of Windows API is UTF-16 based.

Like i said, i know zilch about Ruby but quick googling suggests Ruby string variables can have any charset at will: https://ruby-doc.org/core-2.5.3/String.html

Then WHY would anyone convert it there rather than keeping them UTF-16LE ???

First, you sorta-kinda can switch the user interface to UTF-8, albeit with caveats:

However, further reading the code suggests you care not about user interaction there at all, you only need il=subreg['InstallLocation']. Now, indeed, folder paths CAN be full unicode and be thus inaccessible from classic, pre-unicode applications. It is bad style, but it IS possible, technically.

So, the proper question, i guess, would be WHY to leave UTF16 realm and reduce the strings to windows-866 instead? The next step you would most probably do would be back-converting it to UTF16 so you can call file I/O API, like opening files, enumerating folders, etc.

So...

P.S. i did RegEdit search and it appears i do not have "MSYS2" anywhere in my registry. Guess, it is only different for MSYS2 develoeprs themselves. So, basically, the Ruby fails fatally over attempting to do the search guaranteed to return empty set for 99% of computers... :-/

P.P.S. i tried to guesstimate what on Earth coerces Ruby there to do the unneeded string converions, my eye stumbled on the obvious typo-error there (the sword swing is "slaSHing" not "slaCHing"):

/* ridk_use.rb */

def backslachs(path)
  path.gsub("/", "\\")
end
/* msys2_installation.rb */

    private def backslachs(path)
      path.gsub("/", "\\")
    end

If i apprehend it, then it is https://ruby-doc.org/core-2.5.3/String.html#method-i-gsub

Well, again, nothing there hints aat any pre-configured and fixed string charset, so i still fail to grasp why that fragile and redundant conversion ever gets kicked in in the fist place...

the-Arioch commented 11 months ago

or this, in registry.rb

    def export_string(str, enc = Encoding.default_internal || LOCALE) # :nodoc:
      str.encode(enc)
    end

hence

    def each_key
      index = 0
      while true
        begin
          subkey, wtime = API.EnumKey(@hkey, index)
        rescue Error
          break
        end
        subkey = export_string(subkey)
        yield subkey, wtime

and

    def each_value
      index = 0
      while true
        begin
          subkey = API.EnumValue(@hkey, index)
        rescue Error
          break
        end
        subkey = export_string(subkey)

Now, the key thing probably is that UNUSED variable enc = Encoding.default_internal || LOCALE

It says few interesting things that i can not quite comprehend.

::default_internal is initialized by the source file's internal_encoding or -E option.

and

The locale encoding (ENCODING), not ::default_internal, is used as the encoding of created strings.

I wonder if it can be made "just work" by swtching it to UTF-8 or UTF-16 However, Google says

I can not know what this "theory" would mean in practice given all the legacy code...

the-Arioch commented 11 months ago

Well, "-E" option is as good as not existing

I was thinking about just modifying the "gem.cmd" and call it Hail Mary day, but no luck.

Feels like dead-end on my part (short of removing that loop altogether).

The doc seem to suggest, that overriding global part is possible in the sources, but WHERE to do it safely, if that is even possible at all is above my level.

From abstract common sense it shouldbe OK for Ruby internals just to run full Unicode inside the "OS API" perimeter, but who knows.