rcmaehl / MSEdgeRedirect

A Tool to Redirect News, Search, Widgets, Weather and More to Your Default Browser
https://MSEdgeRedirect.com
GNU Lesser General Public License v3.0
3.96k stars 92 forks source link

Error Opening UTF-8 encoded URL #258

Closed albertlotw closed 1 year ago

albertlotw commented 1 year ago

Preflight Checklist

Install Type

New Deployment (Chocolatey, Winget, Etc)

Install Mode

Active Mode

Steps to reproduce

Clicking a topic link in Windows 10 Start->Setting that opens a UTF8 encoded URL with Traditional Chinese characters. Here is an excerpt from the AppGeneral.log

2023/03/25 02:46:55 - Redirected Edge Call:
Method: Windows.Protocol 
url: https:%2F%2Fwww.bing.com%2Fsearch?q%3D%E6%9B%B4%E6%94%B9%E6%96%87%E5%AD%97%E5%A4%A7%E5%B0%8F%20windows%2010%26form%3DB00032%26ocid%3DSettingsHAQ-BingIA%26mkt%3Dzh-TW

✔️ Expected Behavior

The opened URL to be "https://www.bing.com/search?q=%E6%9B%B4%E6%94%B9%E6%96%87%E5%AD%97%E5%A4%A7%E5%B0%8F%20windows%2010&form=B00032&ocid=SettingsHAQ-BingIA&mkt=zh-TW"

❌ Actual Behavior

This opened page is "https://www.bing.com/search?q=??????????????????%20windows%2010&form=B00032&ocid=SettingsHAQ-BingIA&mkt=zh-TW" 圖片

Microsoft Windows version

"22H2 build 19045.2728 Traditional Chinese"

Other Software

No response

albertlotw commented 1 year ago

Compiling and debugging by myself. I found a problem in function _UnicodeURLDecode(). In my system, the function transforms some '%xx' into question mark characters '?' rather than corresponding ASCII character when 'xx' is greater than 0x80. The _UnicodeURLDecode() uses chr() + StringToBinary() to generate bytes. A detail investigation found chr() function not able to generate all ASCII characters on specific language setting of Windows (ex. Traditional Chinese, code page 950). I just checked autoit-v3 statement $asc=asc(chr($c)). The resulting value $asc would be always 63(0x3F), which is the ASCII code of question mark, when $c is ranging from 0x81 to 0xFE. I've check the same code in Windows 10 with English, the issue does not occur, where the result value of $asc would be always the same as $c.

I have tried to fix the issue and the following code works for me. Instead of chr() + StringToBinary(), the modified code uses binary() to generate binary codes. The StringReplace() call shall be okay doing after decoding UTF8.

Func _UnicodeURLDecode($toDecode)
    Local $strChar = "", $iOne, $iTwo
    Local $aryHex = StringSplit($toDecode, "")
    For $i = 1 To $aryHex[0]
        If $aryHex[$i] = "%" Then
            $i += 1
            $iOne = $aryHex[$i]
            $i += 1
            $iTwo = $aryHex[$i]
            $strChar = $strChar & $iOne & $iTwo
        Else
            $strChar = $strChar & StringRight(Hex(Asc($aryHex[$i])),2)
        EndIf
    Next
    Local $Process = Binary("0x" & $strChar)
    Local $DecodedString = BinaryToString($Process, 4)
    Return StringReplace($DecodedString, "+", " ")
EndFunc   ;==>_UnicodeURLDecode

Running with modified code, I get a correct result. The screenshot: 圖片

rcmaehl commented 1 year ago

I have tried to fix the issue and the following code works for me. Instead of chr() + StringToBinary(), the modified code uses binary() to generate binary codes. The StringReplace() call shall be okay doing after decoding UTF8.

Func _UnicodeURLDecode($toDecode)
    Local $strChar = "", $iOne, $iTwo
    Local $aryHex = StringSplit($toDecode, "")
    For $i = 1 To $aryHex[0]
        If $aryHex[$i] = "%" Then
            $i += 1
            $iOne = $aryHex[$i]
            $i += 1
            $iTwo = $aryHex[$i]
            $strChar = $strChar & $iOne & $iTwo
        Else
            $strChar = $strChar & StringRight(Hex(Asc($aryHex[$i])),2)
        EndIf
    Next
    Local $Process = Binary("0x" & $strChar)
    Local $DecodedString = BinaryToString($Process, 4)
    Return StringReplace($DecodedString, "+", " ")
EndFunc   ;==>_UnicodeURLDecode

I'd double check this code and merge it in if everything looks good.

rcmaehl commented 1 year ago

Please try the following code on your end. It should be a cleaner solution and is working on my end.

Func _UnicodeURLDecode($sData)
    Local $aData = StringSplit(StringReplace($sData,"+"," ",0,1),"%")
    $sData = ""
    For $i = 2 To $aData[0]
        $aData[1] &= Chr(Dec(StringLeft($aData[$i],2))) & StringTrimLeft($aData[$i],2)
    Next
    Return BinaryToString(StringToBinary($aData[1],1),4)
EndFunc
albertlotw commented 1 year ago

Well since the above code still produce a binary string via Chr() + StringToBinary(), it still does not work because Chr($c) produces a question mark character whenever 0x81<$c<0xFE on my side. I think we have to achieve the function without usage of Chr().

The testing code:

$url = "https://www.bing.com/search?q=%E6%9B%B4%E6%94%B9%E6%96%87%E5%AD%97%E5%A4%A7%E5%B0%8F%20windows%2010&form=B00032&ocid=SettingsHAQ-BingIA&mkt=zh-TW"
ConsoleWrite(_UnicodeURLDecode($url));

The result:

https://www.bing.com/search?q=?????????????????? windows 10&form=B00032&ocid=SettingsHAQ-BingIA&mkt=zh-TW
rcmaehl commented 1 year ago

Well since the above code still produce a binary string via Chr() + StringToBinary(), it still does not work because Chr($c) produces a question mark character whenever 0x81<$c<0xFE on my side. I think we have to achieve the function without usage of Chr().

The testing code:

$url = "https://www.bing.com/search?q=%E6%9B%B4%E6%94%B9%E6%96%87%E5%AD%97%E5%A4%A7%E5%B0%8F%20windows%2010&form=B00032&ocid=SettingsHAQ-BingIA&mkt=zh-TW"
ConsoleWrite(_UnicodeURLDecode($url));

The result:

https://www.bing.com/search?q=?????????????????? windows 10&form=B00032&ocid=SettingsHAQ-BingIA&mkt=zh-TW

Whoops. Please replace Chr with ChrW

Also note that ConsoleWrite may have encoding issues because reasons.

albertlotw commented 1 year ago

Whoops. Please replace Chr with ChrW

I think you mean $aData[1] &= ChrW(Dec(StringLeft($aData[$i],2))) & StringTrimLeft($aData[$i],2) The result a little bit weird. I have not checked ChrW() behavior on my side.

https://www.bing.com/search?q=a???a?1a??a-?a???a?X? windows 10&form=B00032&ocid=SettingsHAQ-BingIA&mkt=zh-TW

Also note that ConsoleWrite may have encoding issues because reasons.

Yes I also checked ConsoleWrite on output of _UnicodeURLDecode that uses Binary() and the console output is correct.

albertlotw commented 1 year ago

The following seems okay on my system. But I don't think it is more readable than the original one.

Func _UnicodeURLDecode($sData)
    Local $aData = StringSplit(StringReplace($sData,"+"," ",0,1),"%")
    $aData[1] = Binary($aData[1])
    For $i = 2 To $aData[0]
        $aData[1] &= StringLeft($aData[$i],2) & StringTrimLeft(Binary(StringTrimLeft($aData[$i],2)),2)
    Next
    Return BinaryToString($aData[1],4)
EndFunc
albertlotw commented 1 year ago

I've checked the behavior of ChrW() on my environment. The following code

$c = Chr(0x81)
$cw = ChrW(0x81)
ConsoleWrite("Character Code $c: 0x" & Hex(Asc($c)) & @CRLF)
ConsoleWrite("Character Code $cw: 0x" & Hex(AscW($cw)) & @CRLF)
ConsoleWrite("$cw After StringToBinary : " & StringToBinary($cw) & @CRLF)

gets the output

Character Code $c: 0x0000003F
Character Code $cw: 0x00000081
$cw After StringToBinary : 0x3F

In contrary to Chr() where I get question mark directory on its return, ChrW() could generate a character code greater than 0x80 in my environment, but it still becomes question mark after StringToBinary(). ChrW() returns a wide character string and StringToBinary($cw) converts a wide character string into an ANSI encoded string. The ANSI string is code page dependent. In my system (code page 950) does not define the character corresponding to U+0081. So I get question mark after StringToBinary().

rcmaehl commented 1 year ago

I've swapped to using WinAPI functions for this. Please try the latest test build:

https://github.com/rcmaehl/MSEdgeRedirect/suites/12006614059/artifacts/630536670

albertlotw commented 1 year ago

https://github.com/rcmaehl/MSEdgeRedirect/suites/12006614059/artifacts/630536670

Yes, this one is working for me. Thanks.

rcmaehl commented 1 year ago

https://github.com/rcmaehl/MSEdgeRedirect/suites/12006614059/artifacts/630536670

Yes, this one is working for me. Thanks.

Great!