Closed retronym closed 6 years ago
We think this is due to an increase in the memory usage by the dotty build. @adriaanm planned to reduce the job limit on each Jenkins worker (the EC2 instances have 16GB ram).
Partest currently uses _refArrayOps
. We should remove the _
prefixed versions and make the original names implicit again with the new return types, as suggested in the comment. Let's do that in a separate PR, though, so we can clearly see why we're doing a bootstrap.
@adriaanm, at least, has still seen PR validation failure(s?) in the last few days, even after the number of executors per behemoth was reduced in 2a83d374635951cb3d7c64ee80daf5e44fbb1a53 — including a strange EOF error, perhaps from a forked JVM that died and we don't have the right error reporting in place?
https://scala-ci.typesafe.com/job/scala-2.12.x-integrate-bootstrap/ has failed several nights in a row now with this error. the failures started when we bumped STARR from M5 to RC1 in https://github.com/scala/scala/commit/7507765cfbca595c61f7c850e2a125071020e679 but that's probably a coincidence.
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000000a0580000, 233832448, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 233832448 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /home/jenkins/workspace/scala-2.12.x-integrate-bootstrap/hs_err_pid26131.log
Exception in thread "Thread-33" java.io.EOFException
at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2626)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1321)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373)
at sbt.React.react(ForkTests.scala:114)
at sbt.ForkTests$$anonfun$mainTestTask$1$Acceptor$2$.run(ForkTests.scala:74)
at java.lang.Thread.run(Thread.java:745)
Could also be because we started building 2.12.0 and 2.12.x simultaneously? Didn't check parallelism setting tho On Sun, Sep 11, 2016 at 21:39 Seth Tisue notifications@github.com wrote:
https://scala-ci.typesafe.com/job/scala-2.12.x-integrate-bootstrap/ has failed several nights in a row now with this error. the failures started when we bumped STARR from M5 to RC1 in scala/scala@7507765 https://github.com/scala/scala/commit/7507765cfbca595c61f7c850e2a125071020e679 but that's probably a coincidence.
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000000a0580000, 233832448, 0) failed; error='Cannot allocate memory' (errno=12) #
There is insufficient memory for the Java Runtime Environment to continue.
Native memory allocation (mmap) failed to map 233832448 bytes for committing reserved memory.
An error report file with more information is saved as:
/home/jenkins/workspace/scala-2.12.x-integrate-bootstrap/hs_err_pid26131.log
Exception in thread "Thread-33" java.io.EOFException at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2626) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1321) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373) at sbt.React.react(ForkTests.scala:114) at sbt.ForkTests$$anonfun$mainTestTask$1$Acceptor$2$.run(ForkTests.scala:74) at java.lang.Thread.run(Thread.java:745)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scala/scala-jenkins-infra/issues/181#issuecomment-246199785, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFjy0RI0_qWK6Rqk5zGOrfnkRgZeNGLks5qpFj9gaJpZM4JXS3d .
Could also be because we started building 2.12.0 and 2.12.x simultaneously
I think that's likely.
lately, we have not seen the memory crash causing spurious PR validation failures.
but most scala-2.12.0-integrate-bootstrap and scala-2.12.x-integrate-bootstrap runs have been failing. the bootstrap jobs run on jenkins-worker-ubuntu-publish, not on the worker behemoths.
does https://github.com/scala/scala-jenkins-infra/commit/2a83d374635951cb3d7c64ee80daf5e44fbb1a53 only affect the behemoths?
manual 2.12.0-integrate-bootstrap run, has jenkins-worker-ubuntu-publish all to itself: https://scala-ci.typesafe.com/job/scala-2.12.0-integrate-bootstrap/103/consoleFull
does 2a83d37 only affect the behemoths?
it does: lightWorker = publisher # TODO: better heuristic...
assuming that test run passes, I'll try reducing publisher nodes from 2 to 1 concurrent jobs.
the 7.5 GiB RAM numbers listed at https://github.com/scala/scala-jenkins-infra/blob/master/doc/design.md seem low. I have 16 GB in my laptop
I wonder if we could detect the error in our build scripts and grab some extra diagnostics (e.g. memory used by each processes: http://askubuntu.com/a/62351)
oh good, my manual run passed, that gives me hope that we can put this to rest (for now) by reducing the parallelism on the publishers too
sigh, https://github.com/scala/scala-jenkins-infra/pull/188 failed with some weird Chef error
I set the executors count to 1 at https://scala-ci.typesafe.com/computer/jenkins-worker-ubuntu-publish/configure , maybe the manual setting will actually stick for a while if Chef is borked
several nights of green runs so far. let's keep monitoring it
I would say my experience in the last week or so has been that reducing the parallelism from 4 to 3 definitely helped, but didn't get everything running smoothly, either. bootstrap jobs randomly failing, community builds randomly failing, etc.
though, it can be difficult to distinguish X failing randomly because X is flaky, and X failing randomly because our Jenkins config as a whole is flaky. but my intuition is that
1) at minimum it would be worth further reducing the parallel jobs to 3 to 2, wait a week, and see whether overall flakiness levels drop 2) or perhaps we've put up with this long enough and it's time to either give the nodes more RAM or make more nodes
@adriaanm and I chatted about it just now and he wants to try 2 soon
Instead, I realized it would be easiest to change the EC2 instance type from c4.2xlarge to c4.4xlarge, which doubles the ram to 30 GB and cores to 8. Done for behemoth 1, still pending for number 2.
https://scala-ci.typesafe.com/job/scala-2.12.0-validate-test/93/
# starting 63 tests in run
!! 1 - run/SI-4676.scala [compilation failed]
!! 2 - run/SI-4360.scala [compilation failed]
!! 3 - run/SI-4887.scala [compilation failed]
##### Log file '/home/jenkins/workspace/scala-2.12.0-validate-test/test/scaladoc/run/SI-4360-run.log' from failed test #####
error: Java heap space
##### Log file '/home/jenkins/workspace/scala-2.12.0-validate-test/test/scaladoc/run/SI-4887-run.log' from failed test #####
error: Java heap space
##### Log file '/home/jenkins/workspace/scala-2.12.0-validate-test/test/scaladoc/run/SI-4676-run.log' from failed test #####
error: Java heap space
😱
Some notes from https://github.com/scala/scala/pull/5430:
-Xmx2G
.[ ] DRY up the config of heap size between javaOptions
and testOption
as much as makes sense, as per @adriaanm's suggested:
testOptions in IntegrationTest += Tests.Argument(s"-Dpartest.java_opts=${(javaOptions in IntegrationTest).value.mkString(" ")}")
have we seen this lately?
I'll grep the logs tomorrow.
things had been quiet in recent weeks (is my informal impression), but there was a spate of failures today, e.g. https://scala-ci.typesafe.com/job/scala-2.12.x-validate-test/3528/, reported by @som-snytt
Intermittent? "You keep using that word. I do not think it means what you think it means."
added swap to behemoths in https://github.com/scala/scala-jenkins-infra/commit/47a0d79
quiet on this front lately, especially on the new larger behemoths, and anyway most stuff is moving to Travis-CI+AppVeyor
Seen by @soc, @odersky and I intermittently in the past two days.