Open jneira opened 7 years ago
Thanks for looking into this. This looks like a recurring problem - what should be the final solution for this? I'm not sure setting the encoding for loadString is the right solution since that has to do with the encoding at compile-time (the source files), while this problem appears to be the runtime. Is it safe to always send -Dfile.encoding=UTF-8
without setting the codepage?
-Dfile.encoding=UTF-8
without the other changes (font and console) doesn't throw an error, simply prints out the same char (f.e. ±
for ñ
)
In case the issue was related with the encoding in HSIconv.java in base i've traced it:
HSIConv: iconv with id: 3036687, from: UTF-32BE, to: windows-1252
Init in buffer:
initBuffer: address: 1064988, limit: 4, buffer: java.nio.DirectByteBuffer[pos=16
412 lim=16416 cap=1048576]
Init out buffer:
initBuffer: address: 1048701, limit: 8095, buffer: java.nio.DirectByteBuffer[pos
=125 lim=8220 cap=1048576]
String decoded: ñ
Encoding result: UNDERFLOW
After encoding:
IN: buffer: java.nio.DirectByteBuffer[pos=16416 lim=16416 cap=1048576]
OUT: buffer: java.nio.DirectByteBuffer[pos=126 lim=8220 cap=1048576]
Bytes read: 4
Bytes written: 1
Bytes pre encoding: 000000F1
Bytes recoded: F1
String recoded: ñ
±
So, it seems the problem could be in printing out the string. Unfortunately i am not able to trace its execution.
One thing you can do is add a debugging output to c_write()
in base/java-utils/Utils.java
, to see what bytes are actually being written to the stdout channel.
Hi, after tracing tge c_write method:
HSIConv: Opening iconv from UTF-32BE to windows-1252
HSIConv: iconv with id: 3036687, from: UTF-32BE, to: windows-1252
Init in buffer:
initBuffer: address: 1064988, limit: 4, buffer: java.nio.DirectByteBuffer[pos=16
412 lim=16416 cap=1048576]
Init out buffer:
initBuffer: address: 1048700, limit: 8096, buffer: java.nio.DirectByteBuffer[pos
=124 lim=8220 cap=1048576]
String decoded: ó
Encoding result: UNDERFLOW
After encoding:
IN: buffer: java.nio.DirectByteBuffer[pos=16416 lim=16416 cap=1048576]
OUT: buffer: java.nio.DirectByteBuffer[pos=125 lim=8220 cap=1048576]
Bytes read: 4
Bytes written: 1
Bytes pre encoding: 000000F3
Bytes recoded: F3
String recoded: ó
HSIConv: iconv with id: 3036687, from: UTF-32BE, to: windows-1252
Init in buffer:
initBuffer: address: 1064988, limit: 4, buffer: java.nio.DirectByteBuffer[pos=16
412 lim=16416 cap=1048576]
Init out buffer:
initBuffer: address: 1048701, limit: 8095, buffer: java.nio.DirectByteBuffer[pos
=125 lim=8220 cap=1048576]
String decoded: ñ
Encoding result: UNDERFLOW
After encoding:
IN: buffer: java.nio.DirectByteBuffer[pos=16416 lim=16416 cap=1048576]
OUT: buffer: java.nio.DirectByteBuffer[pos=126 lim=8220 cap=1048576]
Bytes read: 4
Bytes written: 1
Bytes pre encoding: 000000F1
Bytes recoded: F1
String recoded: ñ
c_write: address=1048700, count=2
c_write: bytes to write= F3F1
¾±
Bytes | Cp1250 | Cp850 |
---|---|---|
F3 | ó | ¾ |
F1 | ñ | ± |
chcp 1250
the output is correct.
java -jar -Dfile.encoding=Cp850 dist\build\eta-test\eta-test.launcher.jar
-Dfile.encoding
in the executable, setting UTF-8
by default
I've written a script to set the actual encoding in the java call:
@echo off
set DIR=%~dp0
for /f "tokens=*" %%i in ('powershell -Command "[console]::OutputEncoding.BodyName"') do set cp=%%i
set cp=%cp:=%
echo %cp%
java -Dfile.encoding=%cp% -classpath "%DIR%\eta-test.launcher.jar" eta.main %*
set cp=%cp:=%
removes a single not valid utf-8 char that powershell output adds to cp
var in my current envSome tests:
public class Test {
public static void main (String args[]) {
System.out.println("नमस्ते");
System.out.println("你好");
System.out.println("привет");
System.out.println("Capital EX 🦎's first module!");
}
}
D:\ws\eta\eta-test>chcp
Active code page: 65001
D:\ws\eta\eta-test>java -Dfile.encoding=UTF-8 Test नमस्ते 你好 привет Capital EX 🦎's first module!e!
* Eta
```haskell
module Main where
main = putStrLn "नमस्ते" >> putStrLn "你好" >> putStrLn "привет"
D:\ws\eta\eta-test>java -jar -Dfile.encoding=UTF-8 dist
\build\eta-test\eta-test.launcher.jar
c_write: address=1048576, count=19
c_write: bytes to write= E0A4A8E0A4AEE0A4B8E0A58DE0A4A4E0A5870A
नमस्ते
c_write: address=1048576, count=7
c_write: bytes to write= E4BDA0E5A5BD0A
你好
c_write: address=1048576, count=13
c_write: bytes to write= D0BFD180D0B8D0B2D0B5D1820A
привет
main = putStrLn "नमस्ते" >> putStrLn "你好" >>
putStrLn "привет" >> putStrLn "Capital EX 🦎's first module!"
D:\ws\eta\eta-test>etlas run
Preprocessing executable 'eta-test' for eta-test-0.1.0.0..
Building executable 'eta-test' for eta-test-0.1.0.0..
[1 of 1] Compiling Main ( app\Main.hs, dist\build\eta-test\eta-test-
tmp\Main.jar )
app\Main.hs:6:50: lexical error in string/character literal at character '\129422'
* But with stack/ghc we got the same error
D:\ws\eta\eta-test>stack build ...... eta-test-0.1.0.0: build (exe) Preprocessing executable 'eta-test' for eta-test-0.1.0.0... [1 of 1] Compiling Main ( app\Main.hs, .stack-work\dist\95439361\bui ld\eta-test\eta-test-tmp\Main.o )
D:\ws\eta\eta-test\app\Main.hs:6:50: lexical error in string/character literal at character '\129422'
-- While building package eta-test-0.1.0.0 using: D:\stack\rs\setup-exe-cache\i386-windows\Cabal-si mple_Z6RU0evB_1.22.5.0_ghc-7.10.3.exe --builddir=.stack-work\dist\95439361 build exe:eta-test --ghc-options " -ddump-hi -ddump-to-file" Process exited with code: ExitFailure 1
Another option is to force chcp to utf-8 (and restore after executing java) but the console font must be lucida:
@echo off
set DIR=%~dp0
for /f "tokens=2 delims=:" %%i in ('chcp') do set cp= %%i
chcp 65001 > nul
java -Dfile.encoding=UTF-8 -classpath "%DIR%\eta-test.launcher.jar" eta.main %*
chcp %cp% > nul
@jneira In that case, you can go ahead and implement your changes etlas where the run script is generated.
Thinking more carefully about the possible solutions i think the above scripts are not general enough to cover most user/corner cases. Maybe we should be less ambitous and:
> set JAVA_ARGS="-Dfile.encoding=whatever"
> etlas run -prog-arg1 # or my-executable -prog-arg1
> # it would call: java "-Dfile.encoding=whatever" -classpath "%DIR%\eta-test.launcher.jar" eta.main -prog-arg1
I am in favour of adding the environment variables to control the execution of etlas run
. It is currently a pain to add extra options like stack/heap without manually going into the script and copying the command. I think we should also abstract out java
to JAVA_PROG
in case the user doesn't want to use the default java
on their path.
@rahulmutt ok, i'll try to make a pr with:
$JAVA_ARGS
and $JAVA_OPTS
$JAVA_HOME
so i'll use it if you dont mind Looks good. And yes, JAVA_HOME
is a better option.
After https://github.com/typelead/etlas/pull/22 we should update the documentation with:
@jneira Anymore changes required in the docs?
Maybe it would be useful show somewhere in docs the way to parametize eta executions with $ETA_JAVA_PROGRAM
, $JAVA_HOME
, $JAVA_ARGS
and friends. I think a suitable section could be the etlas user guide
eta.runtime.io.MemoryManager.loadStringUTF8(String s)
, so it doesnt support other encodings (f.e. windows cp-1512)Lucida
chcp 65001
-Dfile.encoding=UTF-8
System.getProperty("file.encoding")
) and use it to load Strings inMemoryManager
. Maybe it would need more changes (ghc-prim?) However i think this issue has low priority since eta already handles utf-8 Strings correctly.