gunnarmorling / 1brc

1️⃣🐝🏎️ The One Billion Row Challenge -- A fun exploration of how quickly 1B rows from a text file can be aggregated with Java
https://www.morling.dev/blog/one-billion-row-challenge/
Apache License 2.0
6.09k stars 1.83k forks source link

Wrong names in Windows? #250

Open przemoc opened 8 months ago

przemoc commented 8 months ago

Disclaimer: I haven't played with Java for ~18 years, so maybe I'm doing something wrong.

PS D:\git\github.com\gunnarmorling\1brc> $PSVersionTable

Name                           Value
----                           -----
PSVersion                      7.3.10
PSEdition                      Core
GitCommitId                    7.3.10
OS                             Microsoft Windows 10.0.22621
Platform                       Win32NT
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0…}
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1
WSManStackVersion              3.0
PS D:\git\github.com\gunnarmorling\1brc> scoop install temurin21-jdk maven
...
PS D:\git\github.com\gunnarmorling\1brc> mvn clean verify
...

PS D:\git\github.com\gunnarmorling\1brc> java --version
openjdk 21.0.1 2023-10-17 LTS
OpenJDK Runtime Environment Temurin-21.0.1+12 (build 21.0.1+12-LTS)
OpenJDK 64-Bit Server VM Temurin-21.0.1+12 (build 21.0.1+12-LTS, mixed mode, sharing)
PS D:\git\github.com\gunnarmorling\1brc> chcp
Active code page: 437
PS D:\git\github.com\gunnarmorling\1brc> chcp 65001
Active code page: 65001
PS D:\git\github.com\gunnarmorling\1brc> $OutputEncoding

Preamble          :
BodyName          : utf-8
EncodingName      : Unicode (UTF-8)
HeaderName        : utf-8
WebName           : utf-8
WindowsCodePage   : 1200
IsBrowserDisplay  : True
IsBrowserSave     : True
IsMailNewsDisplay : True
IsMailNewsSave    : True
IsSingleByte      : False
EncoderFallback   : System.Text.EncoderReplacementFallback
DecoderFallback   : System.Text.DecoderReplacementFallback
IsReadOnly        : True
CodePage          : 65001

PS D:\git\github.com\gunnarmorling\1brc> $Env:JAVA_TOOL_OPTIONS = "-Dfile.encoding=UTF8"
PS D:\git\github.com\gunnarmorling\1brc> java --class-path target/average-1.0.0-SNAPSHOT.jar dev.morling.onebrc.CreateMeasurements 1000000000
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF8
Wrote 50,000,000 measurements in 17124 ms
Wrote 100,000,000 measurements in 34366 ms
Wrote 150,000,000 measurements in 51380 ms
Wrote 200,000,000 measurements in 68445 ms
Wrote 250,000,000 measurements in 85397 ms
Wrote 300,000,000 measurements in 102491 ms
Wrote 350,000,000 measurements in 119489 ms
Wrote 400,000,000 measurements in 136484 ms
Wrote 450,000,000 measurements in 153494 ms
Wrote 500,000,000 measurements in 170461 ms
Wrote 550,000,000 measurements in 187471 ms
Wrote 600,000,000 measurements in 205101 ms
Wrote 650,000,000 measurements in 222205 ms
Wrote 700,000,000 measurements in 239340 ms
Wrote 750,000,000 measurements in 256477 ms
Wrote 800,000,000 measurements in 273675 ms
Wrote 850,000,000 measurements in 290896 ms
Wrote 900,000,000 measurements in 307993 ms
Wrote 950,000,000 measurements in 325116 ms
Created file with 1,000,000,000 measurements in 342196 ms
PS D:\git\github.com\gunnarmorling\1brc> java --class-path target/average-1.0.0-SNAPSHOT.jar dev.morling.onebrc.CalculateAverage >result.txt
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF8
PS D:\git\github.com\gunnarmorling\1brc> ug -m1 -o "Ab.ch." src/main/java/dev/morling/onebrc/CreateMeasurements.java measurements.txt result.txt
src/main/java/dev/morling/onebrc/CreateMeasurements.java
    80: Abéché

measurements.txt
   527: Abéché

result.txt
     1: AbΘchΘ

PS D:\git\github.com\gunnarmorling\1brc> ug --hexdump -m1 -o "Ab.ch." src/main/java/dev/morling/onebrc/CreateMeasurements.java measurements.txt result.txt
src/main/java/dev/morling/onebrc/CreateMeasurements.java
    80:
00000cc0  -- -- -- -- -- -- -- --  -- -- -- -- 41 62 c3 a9  |------------Ab..|
00000cd0  63 68 c3 a9 -- -- -- --  -- -- -- -- -- -- -- --  |ch..------------|

measurements.txt
   527:
00001c10  41 62 c3 a9 63 68 c3 a9  -- -- -- -- -- -- -- --  |Ab..ch..--------|

result.txt
     1:
00000030  41 62 ce 98 63 68 ce 98  -- -- -- -- -- -- -- --  |Ab..ch..--------|

I changed code page to 65001 (UTF-8) and set JAVA_TOOL_OPTIONS to -Dfile.encoding=UTF8 hoping it could improve the situation, but it didn't change anything (originally I tested without those steps).

Can someone explain why there is ce 98 for é instead of c3 a9?

00gh commented 8 months ago

It seems the "> result.txt" redirect is influenced by PS. You should try cmd.exe instead of PS.

Good Luck.

More information and suggestions regarding Code Page issues with PowerShell on Stack Overflow: https://stackoverflow.com/questions/57131654/using-utf-8-encoding-chcp-65001-in-command-prompt-windows-powershell-window

przemoc commented 8 months ago

For cmd it's even worse:

D:\git\github.com\gunnarmorling\1brc>chcp
Active code page: 437

D:\git\github.com\gunnarmorling\1brc>java --class-path target/average-1.0.0-SNAPSHOT.jar dev.morling.onebrc.CalculateAverage >result-cmd.txt

D:\git\github.com\gunnarmorling\1brc>chcp 65001
Active code page: 65001

D:\git\github.com\gunnarmorling\1brc>java --class-path target/average-1.0.0-SNAPSHOT.jar dev.morling.onebrc.CalculateAverage >result-cmd-65001.txt

D:\git\github.com\gunnarmorling\1brc>ug --hexdump -m1 -o "Ab.ch." result-cmd.txt result-cmd-65001.txt
result-cmd.txt
     1:
00000030  41 62 e9 63 68 e9 -- --  -- -- -- -- -- -- -- --  |Ab.ch.----------|

result-cmd-65001.txt
     1:
00000030  41 62 e9 63 68 e9 -- --  -- -- -- -- -- -- -- --  |Ab.ch.----------|

No c3 a9 in sight, only 1 byte which makes ugrep think it is binary file.

But thanks to your SO link I realized I should have looked in PS at [console]::InputEncoding and [console]::OutputEncoding, not $OutputEncoding as I did before.

PS D:\git\github.com\gunnarmorling\1brc> [console]::InputEncoding

Preamble          :
BodyName          : utf-8
EncodingName      : Unicode (UTF-8)
HeaderName        : utf-8
WebName           : utf-8
WindowsCodePage   : 1200
IsBrowserDisplay  : True
IsBrowserSave     : True
IsMailNewsDisplay : True
IsMailNewsSave    : True
IsSingleByte      : False
EncoderFallback   : System.Text.EncoderReplacementFallback
DecoderFallback   : System.Text.DecoderReplacementFallback
IsReadOnly        : True
CodePage          : 65001

PS D:\git\github.com\gunnarmorling\1brc> [console]::OutputEncoding

IsSingleByte      : True
EncodingName      : OEM United States
WebName           : ibm437
HeaderName        : ibm437
BodyName          : ibm437
Preamble          :
WindowsCodePage   :
IsBrowserDisplay  :
IsBrowserSave     :
IsMailNewsDisplay :
IsMailNewsSave    :
EncoderFallback   : System.Text.InternalEncoderBestFitFallback
DecoderFallback   : System.Text.InternalDecoderBestFitFallback
IsReadOnly        : False
CodePage          : 437

which showed that console's output encoding is not UTF-8, but whether it is a source of problem remains to be seen.

I didn't try turning on:

in control intl.cpl yet, but I tried following:

PS D:\git\github.com\gunnarmorling\1brc> [console]::OutputEncoding = New-Object System.Text.UTF8Encoding
PS D:\git\github.com\gunnarmorling\1brc> [console]::OutputEncoding

Preamble          :
BodyName          : utf-8
EncodingName      : Unicode (UTF-8)
HeaderName        : utf-8
WebName           : utf-8
WindowsCodePage   : 1200
IsBrowserDisplay  : True
IsBrowserSave     : True
IsMailNewsDisplay : True
IsMailNewsSave    : True
IsSingleByte      : False
EncoderFallback   : System.Text.EncoderReplacementFallback
DecoderFallback   : System.Text.DecoderReplacementFallback
IsReadOnly        : False
CodePage          : 65001

Retrying test gives new flavour of failure:

PS D:\git\github.com\gunnarmorling\1brc> java --class-path target/average-1.0.0-SNAPSHOT.jar dev.morling.onebrc.CalculateAverage >result-oe-utf8.txt
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF8
PS D:\git\github.com\gunnarmorling\1brc> ug --hexdump -m1 -o "Ab.ch." result-oe-utf8.txt
     1:
00000030  41 62 ef bf bd 63 68 ef  bf bd -- -- -- -- -- --  |Ab...ch...------|                              

Instead of c3 a9, we got ef bf bd...

Spiderpig86 commented 8 months ago

@przemoc Were you able to get it resolved? I am running into the same issues you described above. Git bash does not work properly either for me and I don't want to use WSL for this.

ddimtirov commented 8 months ago

I did my first 2 days on PowerShell - nothing special around file generation, except that Java recognizes that it is on Windows and outputs CRLF line endings.

Turns out this breaks many submissions who assume single byte line endings, and while it is fixable with Java args, in the end I checked out the repo under WSL and used IDEA remoting with WSL backend which worked quite decently.

przemoc commented 6 months ago

@przemoc Were you able to get it resolved? I am running into the same issues you described above. Git bash does not work properly either for me and I don't want to use WSL for this.

No, I didn't spend more time on this and didn't get it resolved.