Handling of character encodings in output of eta executables

jneira commented 7 years ago

Actually eta handles String internally using eta.runtime.io.MemoryManager.loadStringUTF8(String s), so it doesnt support other encodings (f.e. windows cp-1512)
To get UTF-8 work in windows standar console you have to:
- change the console font to Lucida
- change the code page to 65001 (utf-8): chcp 65001
- execute java with -Dfile.encoding=UTF-8
If we wanted to avoid this (and to have same behaviour than ghc) we should detect the current encoding (with System.getProperty("file.encoding")) and use it to load Strings in MemoryManager. Maybe it would need more changes (ghc-prim?) However i think this issue has low priority since eta already handles utf-8 Strings correctly.

rahulmutt commented 7 years ago

Thanks for looking into this. This looks like a recurring problem - what should be the final solution for this? I'm not sure setting the encoding for loadString is the right solution since that has to do with the encoding at compile-time (the source files), while this problem appears to be the runtime. Is it safe to always send -Dfile.encoding=UTF-8 without setting the codepage?

jneira commented 7 years ago

Mmm you are right, Strings are stored always as modified UTF-8 in class files. Maybe some issues with utf-8 strings (emojis?) are related with those modifications
So we would had to change something in their runtime representation (in ghc-prim or base?)
Setting -Dfile.encoding=UTF-8 without the other changes (font and console) doesn't throw an error, simply prints out the same char (f.e. ± for ñ)
- If you does not change the font the console does not change the code page correctly

jneira commented 7 years ago

In case the issue was related with the encoding in HSIconv.java in base i've traced it:

HSIConv: iconv with id: 3036687, from: UTF-32BE, to: windows-1252
Init in buffer:
initBuffer: address: 1064988, limit: 4, buffer: java.nio.DirectByteBuffer[pos=16
412 lim=16416 cap=1048576]
Init out buffer:
initBuffer: address: 1048701, limit: 8095, buffer: java.nio.DirectByteBuffer[pos
=125 lim=8220 cap=1048576]
String decoded: ñ
Encoding result: UNDERFLOW
After encoding:
IN: buffer: java.nio.DirectByteBuffer[pos=16416 lim=16416 cap=1048576]
OUT: buffer: java.nio.DirectByteBuffer[pos=126 lim=8220 cap=1048576]
Bytes read: 4
Bytes written: 1
Bytes pre encoding: 000000F1
Bytes recoded: F1
String recoded: ñ
±

So, it seems the problem could be in printing out the string. Unfortunately i am not able to trace its execution.

rahulmutt commented 7 years ago

One thing you can do is add a debugging output to c_write() in base/java-utils/Utils.java, to see what bytes are actually being written to the stdout channel.

jneira commented 7 years ago

Hi, after tracing tge c_write method:

HSIConv: Opening iconv from UTF-32BE to windows-1252
HSIConv: iconv with id: 3036687, from: UTF-32BE, to: windows-1252
Init in buffer:
initBuffer: address: 1064988, limit: 4, buffer: java.nio.DirectByteBuffer[pos=16
412 lim=16416 cap=1048576]
Init out buffer:
initBuffer: address: 1048700, limit: 8096, buffer: java.nio.DirectByteBuffer[pos
=124 lim=8220 cap=1048576]
String decoded: ó
Encoding result: UNDERFLOW
After encoding:
IN: buffer: java.nio.DirectByteBuffer[pos=16416 lim=16416 cap=1048576]
OUT: buffer: java.nio.DirectByteBuffer[pos=125 lim=8220 cap=1048576]
Bytes read: 4
Bytes written: 1
Bytes pre encoding: 000000F3
Bytes recoded: F3
String recoded: ó
HSIConv: iconv with id: 3036687, from: UTF-32BE, to: windows-1252
Init in buffer:
initBuffer: address: 1064988, limit: 4, buffer: java.nio.DirectByteBuffer[pos=16
412 lim=16416 cap=1048576]
Init out buffer:
initBuffer: address: 1048701, limit: 8095, buffer: java.nio.DirectByteBuffer[pos
=125 lim=8220 cap=1048576]
String decoded: ñ
Encoding result: UNDERFLOW
After encoding:
IN: buffer: java.nio.DirectByteBuffer[pos=16416 lim=16416 cap=1048576]
OUT: buffer: java.nio.DirectByteBuffer[pos=126 lim=8220 cap=1048576]
Bytes read: 4
Bytes written: 1
Bytes pre encoding: 000000F1
Bytes recoded: F1
String recoded: ñ
c_write: address=1048700, count=2
c_write: bytes to write= F3F1
¾±

So eta is printing correctly the bytes recoded by HSIconv. Analyzing the bytes it turns out that console seems to have IBM850 (Cp850) encoding instead windows-1250:

Bytes	Cp1250	Cp850
F3	ó	¾
F1	ñ	±

If you set the console page to 1250 with chcp 1250 the output is correct.
- Or if you execute the test with java -jar -Dfile.encoding=Cp850 dist\build\eta-test\eta-test.launcher.jar
So java thinks the best default file encoding for windows is Cp1250 although the actual file encoding of console is Cp850 (or whatever windows choose for your lang)
The solution could be that etlas extracts the actual codepage (or the actual encoding of bash in nix?) and adds the correct -Dfile.encoding in the executable, setting UTF-8 by default
- This have to be done in the launcher cmd script to get the actual code page

jneira commented 7 years ago

I've written a script to set the actual encoding in the java call:

@echo off
set DIR=%~dp0
for /f "tokens=*" %%i in ('powershell -Command "[console]::OutputEncoding.BodyName"') do set cp=%%i
set cp=%cp:=%
echo %cp%
java -Dfile.encoding=%cp% -classpath "%DIR%\eta-test.launcher.jar" eta.main %*

The system must can run powershell
Althoug it is not visible the line set cp=%cp:=% removes a single not valid utf-8 char that powershell output adds to cp var in my current env
It exploits the fact that windows encodings names match in almost cases with the java.nio names for charsets
But it is an ugly hack and i am not sure that it would not be better to simply put in docs for windows users that you have to set the code page to 1250 (or whatever java thinks you are in) or the steps to get utf-8 works in console (but we have to fix -Dfile.encoding to UTF-8)
In any case, eta handles conversions an outputs correctly in this case and this is mainly a o.s/java issue

jneira commented 7 years ago

Some tests:

Java

public class Test {
public static void main (String args[]) {
    System.out.println("नमस्ते");
    System.out.println("你好");
    System.out.println("привет");
    System.out.println("Capital EX 🦎's first module!");
}
}

Output:


D:\ws\eta\eta-test>chcp
Active code page: 65001

D:\ws\eta\eta-test>java -Dfile.encoding=UTF-8 Test नमस्ते 你好 привет Capital EX 🦎's first module!e!


* Eta
```haskell
module Main where
main = putStrLn "नमस्ते" >> putStrLn "你好" >> putStrLn "привет"

Output

D:\ws\eta\eta-test>java -jar -Dfile.encoding=UTF-8 dist
\build\eta-test\eta-test.launcher.jar
c_write: address=1048576, count=19
c_write: bytes to write= E0A4A8E0A4AEE0A4B8E0A58DE0A4A4E0A5870A
नमस्ते
c_write: address=1048576, count=7
c_write: bytes to write= E4BDA0E5A5BD0A
你好
c_write: address=1048576, count=13
c_write: bytes to write= D0BFD180D0B8D0B2D0B5D1820A
привет

But emojis fails at compile time (like in #472):

main = putStrLn "नमस्ते" >> putStrLn "你好" >>
   putStrLn "привет" >> putStrLn "Capital EX 🦎's first module!"

Output:


D:\ws\eta\eta-test>etlas run
Preprocessing executable 'eta-test' for eta-test-0.1.0.0..
Building executable 'eta-test' for eta-test-0.1.0.0..
[1 of 1] Compiling Main             ( app\Main.hs, dist\build\eta-test\eta-test-
tmp\Main.jar )

app\Main.hs:6:50: lexical error in string/character literal at character '\129422'

* But with stack/ghc we got the same error

D:\ws\eta\eta-test>stack build ...... eta-test-0.1.0.0: build (exe) Preprocessing executable 'eta-test' for eta-test-0.1.0.0... [1 of 1] Compiling Main ( app\Main.hs, .stack-work\dist\95439361\bui ld\eta-test\eta-test-tmp\Main.o )

D:\ws\eta\eta-test\app\Main.hs:6:50: lexical error in string/character literal at character '\129422'

-- While building package eta-test-0.1.0.0 using: D:\stack\rs\setup-exe-cache\i386-windows\Cabal-si mple_Z6RU0evB_1.22.5.0_ghc-7.10.3.exe --builddir=.stack-work\dist\95439361 build exe:eta-test --ghc-options " -ddump-hi -ddump-to-file" Process exited with code: ExitFailure 1

jneira commented 7 years ago

Another option is to force chcp to utf-8 (and restore after executing java) but the console font must be lucida:

@echo off
set DIR=%~dp0
for /f "tokens=2 delims=:" %%i in ('chcp') do set cp= %%i
chcp 65001 > nul
java -Dfile.encoding=UTF-8 -classpath "%DIR%\eta-test.launcher.jar" eta.main %*
chcp %cp% > nul

rahulmutt commented 7 years ago

@jneira In that case, you can go ahead and implement your changes etlas where the run script is generated.

jneira commented 7 years ago

Thinking more carefully about the possible solutions i think the above scripts are not general enough to cover most user/corner cases. Maybe we should be less ambitous and:

Add to docs the caveats on character encodings, giving users clues about how to set a suitable character encoding in console output

Add support for JAVA_ARGS and JAVA_OPTS standard env vars in launcher scripts (nix and win) to control java execution and let users do something like

> set JAVA_ARGS="-Dfile.encoding=whatever"
> etlas run -prog-arg1 # or my-executable -prog-arg1
> # it would call: java "-Dfile.encoding=whatever" -classpath "%DIR%\eta-test.launcher.jar" eta.main -prog-arg1

This args support would give, in general, more precise control over eta executables

rahulmutt commented 7 years ago

I am in favour of adding the environment variables to control the execution of etlas run. It is currently a pain to add extra options like stack/heap without manually going into the script and copying the command. I think we should also abstract out java to JAVA_PROG in case the user doesn't want to use the default java on their path.

jneira commented 7 years ago

@rahulmutt ok, i'll try to make a pr with:

Adding to docs info about the character encoding of output executables, including the way to get UTF-8 work
Changing launch scripts (*nix and win) in etlas to add support to $JAVA_ARGS and $JAVA_OPTS
- The way to select the java version usually is to use the env var $JAVA_HOME so i'll use it if you dont mind

rahulmutt commented 7 years ago

Looks good. And yes, JAVA_HOME is a better option.

jneira commented 6 years ago

After https://github.com/typelead/etlas/pull/22 we should update the documentation with:

Indicate users the way to provide the java version and arguments to launcher scripts
- Maybe in the getting started section and/or in etlas user guide
Show the caveats on character encodings, giving clues about how to set a suitable character encoding in console output
- Taking in account the possible errors and wrong outputs (#527)
- Maybe in the faq

rahulmutt commented 6 years ago

@jneira Anymore changes required in the docs?

jneira commented 6 years ago

Maybe it would be useful show somewhere in docs the way to parametize eta executions with $ETA_JAVA_PROGRAM, $JAVA_HOME, $JAVA_ARGS and friends. I think a suitable section could be the etlas user guide

typelead / eta

Handling of character encodings in output of eta executables #515