Open przemoc opened 8 months ago
It seems the "> result.txt"
redirect is influenced by PS. You should try cmd.exe
instead of PS.
Good Luck.
More information and suggestions regarding Code Page issues with PowerShell on Stack Overflow: https://stackoverflow.com/questions/57131654/using-utf-8-encoding-chcp-65001-in-command-prompt-windows-powershell-window
For cmd it's even worse:
D:\git\github.com\gunnarmorling\1brc>chcp
Active code page: 437
D:\git\github.com\gunnarmorling\1brc>java --class-path target/average-1.0.0-SNAPSHOT.jar dev.morling.onebrc.CalculateAverage >result-cmd.txt
D:\git\github.com\gunnarmorling\1brc>chcp 65001
Active code page: 65001
D:\git\github.com\gunnarmorling\1brc>java --class-path target/average-1.0.0-SNAPSHOT.jar dev.morling.onebrc.CalculateAverage >result-cmd-65001.txt
D:\git\github.com\gunnarmorling\1brc>ug --hexdump -m1 -o "Ab.ch." result-cmd.txt result-cmd-65001.txt
result-cmd.txt
1:
00000030 41 62 e9 63 68 e9 -- -- -- -- -- -- -- -- -- -- |Ab.ch.----------|
result-cmd-65001.txt
1:
00000030 41 62 e9 63 68 e9 -- -- -- -- -- -- -- -- -- -- |Ab.ch.----------|
No c3 a9
in sight, only 1 byte which makes ugrep think it is binary file.
But thanks to your SO link I realized I should have looked in PS at [console]::InputEncoding
and [console]::OutputEncoding
, not $OutputEncoding
as I did before.
PS D:\git\github.com\gunnarmorling\1brc> [console]::InputEncoding
Preamble :
BodyName : utf-8
EncodingName : Unicode (UTF-8)
HeaderName : utf-8
WebName : utf-8
WindowsCodePage : 1200
IsBrowserDisplay : True
IsBrowserSave : True
IsMailNewsDisplay : True
IsMailNewsSave : True
IsSingleByte : False
EncoderFallback : System.Text.EncoderReplacementFallback
DecoderFallback : System.Text.DecoderReplacementFallback
IsReadOnly : True
CodePage : 65001
PS D:\git\github.com\gunnarmorling\1brc> [console]::OutputEncoding
IsSingleByte : True
EncodingName : OEM United States
WebName : ibm437
HeaderName : ibm437
BodyName : ibm437
Preamble :
WindowsCodePage :
IsBrowserDisplay :
IsBrowserSave :
IsMailNewsDisplay :
IsMailNewsSave :
EncoderFallback : System.Text.InternalEncoderBestFitFallback
DecoderFallback : System.Text.InternalDecoderBestFitFallback
IsReadOnly : False
CodePage : 437
which showed that console's output encoding is not UTF-8, but whether it is a source of problem remains to be seen.
I didn't try turning on:
in control intl.cpl
yet, but I tried following:
PS D:\git\github.com\gunnarmorling\1brc> [console]::OutputEncoding = New-Object System.Text.UTF8Encoding
PS D:\git\github.com\gunnarmorling\1brc> [console]::OutputEncoding
Preamble :
BodyName : utf-8
EncodingName : Unicode (UTF-8)
HeaderName : utf-8
WebName : utf-8
WindowsCodePage : 1200
IsBrowserDisplay : True
IsBrowserSave : True
IsMailNewsDisplay : True
IsMailNewsSave : True
IsSingleByte : False
EncoderFallback : System.Text.EncoderReplacementFallback
DecoderFallback : System.Text.DecoderReplacementFallback
IsReadOnly : False
CodePage : 65001
Retrying test gives new flavour of failure:
PS D:\git\github.com\gunnarmorling\1brc> java --class-path target/average-1.0.0-SNAPSHOT.jar dev.morling.onebrc.CalculateAverage >result-oe-utf8.txt
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF8
PS D:\git\github.com\gunnarmorling\1brc> ug --hexdump -m1 -o "Ab.ch." result-oe-utf8.txt
1:
00000030 41 62 ef bf bd 63 68 ef bf bd -- -- -- -- -- -- |Ab...ch...------|
Instead of c3 a9
, we got ef bf bd
...
@przemoc Were you able to get it resolved? I am running into the same issues you described above. Git bash does not work properly either for me and I don't want to use WSL for this.
I did my first 2 days on PowerShell - nothing special around file generation, except that Java recognizes that it is on Windows and outputs CRLF line endings.
Turns out this breaks many submissions who assume single byte line endings, and while it is fixable with Java args, in the end I checked out the repo under WSL and used IDEA remoting with WSL backend which worked quite decently.
@przemoc Were you able to get it resolved? I am running into the same issues you described above. Git bash does not work properly either for me and I don't want to use WSL for this.
No, I didn't spend more time on this and didn't get it resolved.
Disclaimer: I haven't played with Java for ~18 years, so maybe I'm doing something wrong.
I changed code page to
65001
(UTF-8) and setJAVA_TOOL_OPTIONS
to-Dfile.encoding=UTF8
hoping it could improve the situation, but it didn't change anything (originally I tested without those steps).Can someone explain why there is
ce 98
foré
instead ofc3 a9
?