typelead / eta

The Eta Programming Language, a dialect of Haskell on the JVM
https://eta-lang.org
BSD 3-Clause "New" or "Revised" License
2.61k stars 141 forks source link

Handling of character encodings in output of eta executables #515

Open jneira opened 7 years ago

jneira commented 7 years ago
rahulmutt commented 7 years ago

Thanks for looking into this. This looks like a recurring problem - what should be the final solution for this? I'm not sure setting the encoding for loadString is the right solution since that has to do with the encoding at compile-time (the source files), while this problem appears to be the runtime. Is it safe to always send -Dfile.encoding=UTF-8 without setting the codepage?

jneira commented 7 years ago
jneira commented 7 years ago

In case the issue was related with the encoding in HSIconv.java in base i've traced it:

HSIConv: iconv with id: 3036687, from: UTF-32BE, to: windows-1252
Init in buffer:
initBuffer: address: 1064988, limit: 4, buffer: java.nio.DirectByteBuffer[pos=16
412 lim=16416 cap=1048576]
Init out buffer:
initBuffer: address: 1048701, limit: 8095, buffer: java.nio.DirectByteBuffer[pos
=125 lim=8220 cap=1048576]
String decoded: ñ
Encoding result: UNDERFLOW
After encoding:
IN: buffer: java.nio.DirectByteBuffer[pos=16416 lim=16416 cap=1048576]
OUT: buffer: java.nio.DirectByteBuffer[pos=126 lim=8220 cap=1048576]
Bytes read: 4
Bytes written: 1
Bytes pre encoding: 000000F1
Bytes recoded: F1
String recoded: ñ
±

So, it seems the problem could be in printing out the string. Unfortunately i am not able to trace its execution.

rahulmutt commented 7 years ago

One thing you can do is add a debugging output to c_write() in base/java-utils/Utils.java, to see what bytes are actually being written to the stdout channel.

jneira commented 7 years ago

Hi, after tracing tge c_write method:

HSIConv: Opening iconv from UTF-32BE to windows-1252
HSIConv: iconv with id: 3036687, from: UTF-32BE, to: windows-1252
Init in buffer:
initBuffer: address: 1064988, limit: 4, buffer: java.nio.DirectByteBuffer[pos=16
412 lim=16416 cap=1048576]
Init out buffer:
initBuffer: address: 1048700, limit: 8096, buffer: java.nio.DirectByteBuffer[pos
=124 lim=8220 cap=1048576]
String decoded: ó
Encoding result: UNDERFLOW
After encoding:
IN: buffer: java.nio.DirectByteBuffer[pos=16416 lim=16416 cap=1048576]
OUT: buffer: java.nio.DirectByteBuffer[pos=125 lim=8220 cap=1048576]
Bytes read: 4
Bytes written: 1
Bytes pre encoding: 000000F3
Bytes recoded: F3
String recoded: ó
HSIConv: iconv with id: 3036687, from: UTF-32BE, to: windows-1252
Init in buffer:
initBuffer: address: 1064988, limit: 4, buffer: java.nio.DirectByteBuffer[pos=16
412 lim=16416 cap=1048576]
Init out buffer:
initBuffer: address: 1048701, limit: 8095, buffer: java.nio.DirectByteBuffer[pos
=125 lim=8220 cap=1048576]
String decoded: ñ
Encoding result: UNDERFLOW
After encoding:
IN: buffer: java.nio.DirectByteBuffer[pos=16416 lim=16416 cap=1048576]
OUT: buffer: java.nio.DirectByteBuffer[pos=126 lim=8220 cap=1048576]
Bytes read: 4
Bytes written: 1
Bytes pre encoding: 000000F1
Bytes recoded: F1
String recoded: ñ
c_write: address=1048700, count=2
c_write: bytes to write= F3F1
¾±
Bytes Cp1250 Cp850
F3 ó ¾
F1 ñ ±
jneira commented 7 years ago

I've written a script to set the actual encoding in the java call:

@echo off
set DIR=%~dp0
for /f "tokens=*" %%i in ('powershell -Command "[console]::OutputEncoding.BodyName"') do set cp=%%i
set cp=%cp:=%
echo %cp%
java -Dfile.encoding=%cp% -classpath "%DIR%\eta-test.launcher.jar" eta.main %*
jneira commented 7 years ago

Some tests:

D:\ws\eta\eta-test>java -Dfile.encoding=UTF-8 Test नमस्ते 你好 привет Capital EX 🦎's first module!e!


* Eta
```haskell
module Main where
main = putStrLn "नमस्ते" >> putStrLn "你好" >> putStrLn "привет"

app\Main.hs:6:50: lexical error in string/character literal at character '\129422'

* But with stack/ghc we got the same error

D:\ws\eta\eta-test>stack build ...... eta-test-0.1.0.0: build (exe) Preprocessing executable 'eta-test' for eta-test-0.1.0.0... [1 of 1] Compiling Main ( app\Main.hs, .stack-work\dist\95439361\bui ld\eta-test\eta-test-tmp\Main.o )

D:\ws\eta\eta-test\app\Main.hs:6:50: lexical error in string/character literal at character '\129422'

-- While building package eta-test-0.1.0.0 using: D:\stack\rs\setup-exe-cache\i386-windows\Cabal-si mple_Z6RU0evB_1.22.5.0_ghc-7.10.3.exe --builddir=.stack-work\dist\95439361 build exe:eta-test --ghc-options " -ddump-hi -ddump-to-file" Process exited with code: ExitFailure 1

jneira commented 7 years ago

Another option is to force chcp to utf-8 (and restore after executing java) but the console font must be lucida:

@echo off
set DIR=%~dp0
for /f "tokens=2 delims=:" %%i in ('chcp') do set cp= %%i
chcp 65001 > nul
java -Dfile.encoding=UTF-8 -classpath "%DIR%\eta-test.launcher.jar" eta.main %*
chcp %cp% > nul
rahulmutt commented 7 years ago

@jneira In that case, you can go ahead and implement your changes etlas where the run script is generated.

jneira commented 7 years ago

Thinking more carefully about the possible solutions i think the above scripts are not general enough to cover most user/corner cases. Maybe we should be less ambitous and:

rahulmutt commented 7 years ago

I am in favour of adding the environment variables to control the execution of etlas run. It is currently a pain to add extra options like stack/heap without manually going into the script and copying the command. I think we should also abstract out java to JAVA_PROG in case the user doesn't want to use the default java on their path.

jneira commented 7 years ago

@rahulmutt ok, i'll try to make a pr with:

rahulmutt commented 7 years ago

Looks good. And yes, JAVA_HOME is a better option.

jneira commented 6 years ago

After https://github.com/typelead/etlas/pull/22 we should update the documentation with:

rahulmutt commented 6 years ago

@jneira Anymore changes required in the docs?

jneira commented 6 years ago

Maybe it would be useful show somewhere in docs the way to parametize eta executions with $ETA_JAVA_PROGRAM, $JAVA_HOME, $JAVA_ARGS and friends. I think a suitable section could be the etlas user guide