scala / scala-jenkins-infra

A Chef cookbook that manages Scala's CI infrastructure.
https://scala-ci.typesafe.com
Apache License 2.0
14 stars 17 forks source link

JVM crash with "There is insufficient memory for the Java Runtime Environment to continue" #181

Closed retronym closed 6 years ago

retronym commented 8 years ago
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 123207680 bytes for committing reserved memory.
# Possible reasons:
#   The system is out of physical RAM or swap space
#   In 32 bit mode, the process size limit was hit
# Possible solutions:

Seen by @soc, @odersky and I intermittently in the past two days.

retronym commented 8 years ago

Example: https://scala-ci.typesafe.com/job/dotty-master-validate-partest-bootstrapped/953/

retronym commented 8 years ago

We think this is due to an increase in the memory usage by the dotty build. @adriaanm planned to reduce the job limit on each Jenkins worker (the EC2 instances have 16GB ram).

retronym commented 8 years ago

Partest currently uses _refArrayOps. We should remove the _ prefixed versions and make the original names implicit again with the new return types, as suggested in the comment. Let's do that in a separate PR, though, so we can clearly see why we're doing a bootstrap.

SethTisue commented 8 years ago

@adriaanm, at least, has still seen PR validation failure(s?) in the last few days, even after the number of executors per behemoth was reduced in 2a83d374635951cb3d7c64ee80daf5e44fbb1a53 — including a strange EOF error, perhaps from a forked JVM that died and we don't have the right error reporting in place?

SethTisue commented 8 years ago

https://scala-ci.typesafe.com/job/scala-2.12.x-integrate-bootstrap/ has failed several nights in a row now with this error. the failures started when we bumped STARR from M5 to RC1 in https://github.com/scala/scala/commit/7507765cfbca595c61f7c850e2a125071020e679 but that's probably a coincidence.

OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000000a0580000, 233832448, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 233832448 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /home/jenkins/workspace/scala-2.12.x-integrate-bootstrap/hs_err_pid26131.log
Exception in thread "Thread-33" java.io.EOFException
    at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2626)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1321)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373)
    at sbt.React.react(ForkTests.scala:114)
    at sbt.ForkTests$$anonfun$mainTestTask$1$Acceptor$2$.run(ForkTests.scala:74)
    at java.lang.Thread.run(Thread.java:745)
adriaanm commented 8 years ago

Could also be because we started building 2.12.0 and 2.12.x simultaneously? Didn't check parallelism setting tho On Sun, Sep 11, 2016 at 21:39 Seth Tisue notifications@github.com wrote:

https://scala-ci.typesafe.com/job/scala-2.12.x-integrate-bootstrap/ has failed several nights in a row now with this error. the failures started when we bumped STARR from M5 to RC1 in scala/scala@7507765 https://github.com/scala/scala/commit/7507765cfbca595c61f7c850e2a125071020e679 but that's probably a coincidence.

OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000000a0580000, 233832448, 0) failed; error='Cannot allocate memory' (errno=12) #

There is insufficient memory for the Java Runtime Environment to continue.

Native memory allocation (mmap) failed to map 233832448 bytes for committing reserved memory.

An error report file with more information is saved as:

/home/jenkins/workspace/scala-2.12.x-integrate-bootstrap/hs_err_pid26131.log

Exception in thread "Thread-33" java.io.EOFException at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2626) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1321) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373) at sbt.React.react(ForkTests.scala:114) at sbt.ForkTests$$anonfun$mainTestTask$1$Acceptor$2$.run(ForkTests.scala:74) at java.lang.Thread.run(Thread.java:745)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scala/scala-jenkins-infra/issues/181#issuecomment-246199785, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFjy0RI0_qWK6Rqk5zGOrfnkRgZeNGLks5qpFj9gaJpZM4JXS3d .

SethTisue commented 8 years ago

Could also be because we started building 2.12.0 and 2.12.x simultaneously

I think that's likely.

lately, we have not seen the memory crash causing spurious PR validation failures.

but most scala-2.12.0-integrate-bootstrap and scala-2.12.x-integrate-bootstrap runs have been failing. the bootstrap jobs run on jenkins-worker-ubuntu-publish, not on the worker behemoths.

does https://github.com/scala/scala-jenkins-infra/commit/2a83d374635951cb3d7c64ee80daf5e44fbb1a53 only affect the behemoths?

SethTisue commented 8 years ago

manual 2.12.0-integrate-bootstrap run, has jenkins-worker-ubuntu-publish all to itself: https://scala-ci.typesafe.com/job/scala-2.12.0-integrate-bootstrap/103/consoleFull

SethTisue commented 8 years ago

does 2a83d37 only affect the behemoths?

it does: lightWorker = publisher # TODO: better heuristic...

SethTisue commented 8 years ago

assuming that test run passes, I'll try reducing publisher nodes from 2 to 1 concurrent jobs.

SethTisue commented 8 years ago

the 7.5 GiB RAM numbers listed at https://github.com/scala/scala-jenkins-infra/blob/master/doc/design.md seem low. I have 16 GB in my laptop

retronym commented 8 years ago

I wonder if we could detect the error in our build scripts and grab some extra diagnostics (e.g. memory used by each processes: http://askubuntu.com/a/62351)

SethTisue commented 8 years ago

oh good, my manual run passed, that gives me hope that we can put this to rest (for now) by reducing the parallelism on the publishers too

SethTisue commented 8 years ago

sigh, https://github.com/scala/scala-jenkins-infra/pull/188 failed with some weird Chef error

SethTisue commented 8 years ago

I set the executors count to 1 at https://scala-ci.typesafe.com/computer/jenkins-worker-ubuntu-publish/configure , maybe the manual setting will actually stick for a while if Chef is borked

SethTisue commented 8 years ago

several nights of green runs so far. let's keep monitoring it

SethTisue commented 8 years ago

I would say my experience in the last week or so has been that reducing the parallelism from 4 to 3 definitely helped, but didn't get everything running smoothly, either. bootstrap jobs randomly failing, community builds randomly failing, etc.

though, it can be difficult to distinguish X failing randomly because X is flaky, and X failing randomly because our Jenkins config as a whole is flaky. but my intuition is that

1) at minimum it would be worth further reducing the parallel jobs to 3 to 2, wait a week, and see whether overall flakiness levels drop 2) or perhaps we've put up with this long enough and it's time to either give the nodes more RAM or make more nodes

@adriaanm and I chatted about it just now and he wants to try 2 soon

adriaanm commented 8 years ago

Instead, I realized it would be easiest to change the EC2 instance type from c4.2xlarge to c4.4xlarge, which doubles the ram to 30 GB and cores to 8. Done for behemoth 1, still pending for number 2.

SethTisue commented 8 years ago

https://scala-ci.typesafe.com/job/scala-2.12.0-validate-test/93/

# starting 63 tests in run
!!  1 - run/SI-4676.scala                         [compilation failed]
!!  2 - run/SI-4360.scala                         [compilation failed]
!!  3 - run/SI-4887.scala                         [compilation failed]
##### Log file '/home/jenkins/workspace/scala-2.12.0-validate-test/test/scaladoc/run/SI-4360-run.log' from failed test #####

error: Java heap space

##### Log file '/home/jenkins/workspace/scala-2.12.0-validate-test/test/scaladoc/run/SI-4887-run.log' from failed test #####

error: Java heap space

##### Log file '/home/jenkins/workspace/scala-2.12.0-validate-test/test/scaladoc/run/SI-4676-run.log' from failed test #####

error: Java heap space

😱

retronym commented 8 years ago

Some notes from https://github.com/scala/scala/pull/5430:

SethTisue commented 7 years ago

have we seen this lately?

adriaanm commented 7 years ago

I'll grep the logs tomorrow.

SethTisue commented 7 years ago

things had been quiet in recent weeks (is my informal impression), but there was a spate of failures today, e.g. https://scala-ci.typesafe.com/job/scala-2.12.x-validate-test/3528/, reported by @som-snytt

som-snytt commented 7 years ago

Intermittent? "You keep using that word. I do not think it means what you think it means."

adriaanm commented 7 years ago

added swap to behemoths in https://github.com/scala/scala-jenkins-infra/commit/47a0d79

SethTisue commented 6 years ago

quiet on this front lately, especially on the new larger behemoths, and anyway most stuff is moving to Travis-CI+AppVeyor

sundharsk commented 5 years ago

JRE_error