scala-ide / scala-worksheet

A Scala IDE plugin for a multi-line REPL (called worksheet)
96 stars 24 forks source link

UTF-8 output gets mangled in the Scala Worksheet #185

Open Blaisorblade opened 10 years ago

Blaisorblade commented 10 years ago

My Scala code (a lambda-calculus implementation) produces UTF-8 output. The worksheet is exactly what I'd want, except that it doesn't cope with UTF-8 program output. The whole project is using UTF-8 as far as I can tell, as the workspace is.

For instance, compare an output fragment, as seen by running the Scala REPL inside Eclipse: ((ℤ → ℤ) → ℤ → ℤ) → (ℤ → ℤ) → ℤ → ℤ) with what I get in the Worksheet:

((��� ��� ���) ��� ��� ��� ���) ��� (��� ��� ���) ��� ��� ��� ���)

Each Unicode character translates to three question marks because all these characters take 3 bytes in UTF-8 (because they're outside the BMP).

This is with version 3.0.4 of Scala IDE. More precisely: Scala Worksheet 0.2.3.v-2_11-201405200954-4f7988d org.scalaide.worksheet.feature.feature.group Scala IDE Scala IDE for Eclipse 3.0.4.v-2_11-201405200946-c46f499 org.scala-ide.sdt.feature.feature.group scala-ide.org

(Plus Scala Search & ScalaTest plugins, I could provide those version numbers if needed).

I've looked at the current source code (which maybe was a bad idea), and it seems that the conversion should be done purely by Eclipse libraries here, and I can't see anything wrong with that:

https://github.com/scala-ide/scala-worksheet/blob/0281642ce05e2420e72fbf1e7551f945c16b811d/org.scalaide.worksheet/src/org/scalaide/worksheet/runtime/ProgramExecutor.scala#L141

skyluc commented 10 years ago

Are you running on a non-UTF-8 system? For Java point-of-view, about all operating system, except correctly configured Linux machines, are using an encoding different than UTF-8. It is relevant because the execution of the worksheet code is done in a forked process, and it is likely that the encoding is not forced to UTF-8.

Blaisorblade commented 10 years ago

Thanks for the prompt answer! I assumed this would be a problem when decoding from the stream, but you might still be right.

Do you agree that using the host configuration would be a bug?

I investigated a bit, and before answering your question, I'll give my analysis: Eclipse is correctly configured to use UTF-8 (according to this: http://stackoverflow.com/a/9181068/53974), and that should be enough. Instead, I also need to set -Dfile.encoding=UTF8 in eclipse.ini, and the worksheet works correctly if and only if that option is active. (When relaunching Eclipse, I also need to modify & save the worksheet to update the output).

Analysis: Since the documented setting is inside Eclipse itself, it seems that what I'm doing is a hack, needed because some code uses the default encoding instead of passing the Eclipse-configured one. Now, I don't envy the poor soul who's supposed to debug this (you forget to thread the encoding once and you have a bug), even though I suppose those needed for people configuring multiple encodings. So I'll be OK with any resolution other than not "not-a-bug" — for instance, I'd be happy with WontFix or a late milestone/low priority, as long as the workaround is documented.

Side note/additional issue: line breaking seems very much not Unicode-aware, both in practice:

  val test1T: Term = test1                        //> test1T  : ilc.feature.let.ANormalFormTest.v.Term = App(Abs(Var(id,((ℤ → 
                                                  //| ℤ) → ℤ → ℤ) → (ℤ → ℤ) → ℤ → ℤ),App(Abs(Var(id_i,�
                                                  //| � → ℤ),App(Abs(Var(apply,(ℤ → ℤ) → ℤ → ℤ),App(App(App(Var(

And maybe happens because this implementation is in terms of bytes — it adds newlines after a certain byte count, but I didn't run anything with debugging:

https://github.com/scala-ide/scala-worksheet/blob/646a40c38c186f6f3d690a542bb6b9180e601318/org.scalaide.worksheet.runtime.library/src/main/scala/org/scalaide/worksheet/runtime/library/WorksheetSupport.scala#L19


Are you running on a non-UTF-8 system?

As far as I can tell, no. I'd be happy to try a test of your choice.

I'm using OS X 10.9, but almost everything else on my system is handling Unicode correctly. I say "almost" because IIRC some programs (TextEdit) still dare offer me "Mac OS Roman" as default encoding.

Regarding -Dfile.encoding=UTF8, most of my JVMs have that option (according to jvisualvm). Eclipse didn't, but still, both in the Scala REPL and in the worksheet, the property seems correctly set. However, setting -Dfile.encoding made a difference, not sure why.

Scala REPL, both inside and outside Eclipse, and

scala> sys.props("file.encoding")
res4: String = UTF-8

sys.props("file.encoding")                      //> res0: String = UTF-8

Also, from the prompt:

$ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"

Finally, I run this program:

package charset;

public class TestCharset {
  public static void main(String[] args) {
    System.out.println(System.getProperty("file.encoding"));
  }
}

and got this output:

$ java charset.TestCharset
UTF-8

So the default encoding seems to be the right one. But I must be missing something, since -Dfile.encoding=UTF8 made a difference for Eclipse.

skogler commented 10 years ago

I am also getting this issue on a UTF-8 system. All files are correctly configured to use UTF-8. The line splitting in the worksheet messes up the output.