scalacenter / bloop

Bloop is a build server and CLI tool to compile, test and run Scala fast from any editor or build tool.
https://scalacenter.github.io/bloop/
Apache License 2.0
907 stars 202 forks source link

Test runner that sets System.err causes Bloop server to become unresponsive #1155

Open olafurpg opened 4 years ago

olafurpg commented 4 years ago

Using a native-image of bloopgun, I'm unable to run bloop exit

❯ bloop exit
error: Unexpected error forces client exit!
java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:397)
    at bloop.shaded.snailgun.protocol.Protocol.$anonfun$processChunkFromServer$1(Protocol.scala:181)
    at scala.util.Try$.apply(Try.scala:213)
    at bloop.shaded.snailgun.protocol.Protocol.processChunkFromServer(Protocol.scala:180)
    at bloop.shaded.snailgun.protocol.Protocol.sendCommand(Protocol.scala:108)
    at bloop.shaded.snailgun.TcpClient.run(TcpClient.scala:34)
    at bloop.bloopgun.BloopgunCli.executeCmd$1(Bloopgun.scala:255)
    at bloop.bloopgun.BloopgunCli.fireCommand(Bloopgun.scala:261)
    at bloop.bloopgun.BloopgunCli.run(Bloopgun.scala:216)
    at bloop.bloopgun.Bloopgun$.main(Bloopgun.scala:560)
    at bloop.bloopgun.Bloopgun.main(Bloopgun.scala)

After running this command once, I'm unable to communicate with the Bloop server with other commands

❯ bloop help
error: Unexpected error forces client exit!
java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:397)
    at bloop.shaded.snailgun.protocol.Protocol.$anonfun$processChunkFromServer$1(Protocol.scala:181)
    at scala.util.Try$.apply(Try.scala:213)
    at bloop.shaded.snailgun.protocol.Protocol.processChunkFromServer(Protocol.scala:180)
    at bloop.shaded.snailgun.protocol.Protocol.sendCommand(Protocol.scala:108)
    at bloop.shaded.snailgun.TcpClient.run(TcpClient.scala:34)
    at bloop.bloopgun.BloopgunCli.executeCmd$1(Bloopgun.scala:255)
    at bloop.bloopgun.BloopgunCli.fireCommand(Bloopgun.scala:261)
    at bloop.bloopgun.BloopgunCli.run(Bloopgun.scala:216)
    at bloop.bloopgun.Bloopgun$.main(Bloopgun.scala:560)
    at bloop.bloopgun.Bloopgun.main(Bloopgun.scala)
olafurpg commented 4 years ago

The issue looks related to native-image, I am unable to reproduce with the JVM version of bloopgun


$ coursier launch ch.epfl.scala:bloopgun_2.12:1.4.0-RC1 --main-class bloop.bloopgun.Bloopgun -- exit
``
jvican commented 4 years ago

Thanks for reporting! I'm not sure why this might be happening. I would expect bloopgun to be able to communicate with the server even if bloop exit fails.

jvican commented 4 years ago

I've just tried to reproduce this issue and couldn't. I'm using GraalVM 19.2.1 (/Library/Java/JavaVirtualMachines/graalvm-ce-19.2.1/Contents/Home/jre).

olafurpg commented 4 years ago

I'm unable to reproduce this with a single bloop exit command but I am still able to consistently reproduce this when running the same failing test. The problem seems unrelated to native-image, since I can reproduce it with both a standalone JVM bloopgun binary as well as bloop.py.

❯ bloop help # starts new server
❯ bloop test <redacted>
...
===============================================
Total duration: 0ms
1 failed

Failed:
- ...
  * null - java.lang.ExceptionInInitializerError: null
===============================================

After the test fails I'm no longer able to run any bloop command

❯ bloop help
error: Unexpected error forces client exit!
java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:392)
    at bloop.shaded.snailgun.protocol.Protocol.$anonfun$processChunkFromServer$1(Protocol.scala:181)
    at scala.util.Try$.apply(Try.scala:213)
    at bloop.shaded.snailgun.protocol.Protocol.processChunkFromServer(Protocol.scala:180)
    at bloop.shaded.snailgun.protocol.Protocol.sendCommand(Protocol.scala:108)
    at bloop.shaded.snailgun.TcpClient.run(TcpClient.scala:34)
    at bloop.bloopgun.BloopgunCli.executeCmd$1(Bloopgun.scala:255)
    at bloop.bloopgun.BloopgunCli.fireCommand(Bloopgun.scala:261)
    at bloop.bloopgun.BloopgunCli.run(Bloopgun.scala:216)
    at bloop.bloopgun.Bloopgun$.main(Bloopgun.scala:560)
    at bloop.bloopgun.Bloopgun.main(Bloopgun.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at coursier.bootstrap.launcher.a.a(Unknown Source)
    at coursier.bootstrap.launcher.ResourcesLauncher.main(Unknown Source)

I'm only able to stop the Bloop server with the following command

❯ kill $(jps -l | grep bloop.Server | awk '{ print $1 }')

When I reproduce with bloop.py I get a different error message:

❯ ~/dev/metals/metals/src/main/resources/bloop.py test ...
...
Failed:
- ...
  * null - java.lang.ExceptionInInitializerError: null
===============================================
❯ ~/dev/metals/metals/src/main/resources/bloop.py help
Server unexpectedly disconnected in recv_into()%

Adding --verbose to the test command doesn't reveal any interesting information from what I can see

[D] Runner for com.geirsson.junit.PantsFramework produced 1 initial tasks for 1 tests.
[D]   Running TaskDef(<redacted>, sbt.ForkMain$SubclassFingerscan@3f7bdc73, false, [SuiteSelector])
2020-01-27 15:43:47,976 INFO                   c.t.u.l.Slf4jBridgeUtility$ org.slf4j.bridge.SLF4JBridgeHandler installed.
...
==> X null  0.0s java.lang.ExceptionInInitializerError: null
Caused by: java.lang.NullPointerException
[D]     Produced 0 nested tasks and 1 events.
Execution took 0ms
1 tests, 1 failed

[D] Test server has been successfully closed.
[D] The process is closed. Emptying buffer...
[D] Forked process exited with code: 0
[D] Process 71263 exited with code: 0
[D] Cleaning up temporary MANIFEST jar: /var/folders/y5/gbz6gd297z57_5bg0nm589nr0000gp/T/jvm-forker-manifest7835550152620902158.jar
===============================================
Total duration: 0ms
1 failed

Failed:
- <redacted>
  * null - java.lang.ExceptionInInitializerError: null
===============================================
[D] Test suites failed: true
[D] Elapsed: 11725.022629 ms
olafurpg commented 4 years ago

I'm able to reproduce with 1.4.0-RC1 as well as 1.4.0-RC1+16-1cd95092

olafurpg commented 4 years ago

I may have tracked down the root cause by running bloop console PROJECT_NAME

bloop console <redacted>
Loading...
Compiling (synthetic)/ammonite/predef/interpBridge.sc
scala.reflect.internal.FatalError: Error accessing /Users/lgeirsson/workspace/source/.pants.d/export-dep-as-jar/export-dep-as-jar/252d64521cf9/tmpn0s2b0vm.jar
  scala.tools.nsc.classpath.AggregateClassPath.$anonfun$list$3(AggregateClassPath.scala:99)

This Bloop JSON file contains jars on the classpath, which don't exist. This JSON file was created with an older version of the Pants exporter. I am unable to reproduce when using the latest version of the Pants exporter (which copies over jars and uses an immutable classpath).

I believe Bloop should still be able to gracefully handle this situation, maybe by reporting a warning under --verbose when the classpath contains non-existing entries.

olafurpg commented 4 years ago

A minimized reproduction can be found here https://github.com/olafurpg/intellij-bsp-pants/issues/11

Following up from an offline investigation, @jvican discovered that the Pants test runner is setting System.err which (to my understanding) caused an exception in Nailgun that shuts down the IO scheduler in Bloop causing the unresponsiveness. I can confirm that I'm unable to reproduce the error by changing the Pants test runner to only set System.err during test execution and setting System.err back to the old value in a finally block.

jvican commented 4 years ago

Thanks for the follow-up and the reproduction, I'm happy we tracked this issue down!

Given that System.setErr could be triggered by classloading the test jars, I believe we should add some logic to bloop to reset back streams right after classloading is done. Nailgun needs the streams to be instances of their concrete thread-local stream classes so that it can redirect in, out and err directly to the socket streams of every client, even when several clients run at the same time.

The implementation of the fix on the bloop side should account for the fact that test classloading can happen concurrently when parallel execution is enabled.