Closed jklmli closed 11 years ago
I recently switched from openjdk to oracle jdk, and I'm not seeing this anymore.
Do you get any exceptions? Could you please attach the start up log.
Re running the Console for a period of time: The Console has switched focus from being a production monitoring tool to become more developer focused. As such it does not make sense to have it running for a longer period of time. I will have a look at the issue when there is some time over, but it's not the primary focus at the moment.
Can I ask why the focus has shifted?
As I understood it, the advantage of the paid version was to allow for persisted logs over a long period of time.
Yes, you're correct in that the licensed version includes persistence to MongoDB. This functionality is still available in the 1.3 series.
Being a quite small team we have to focus on one thing and the decision has been made to concentrate on the developer experience rather than production monitoring. We will collaborate with partners around the latter area.
More information from typesafe.com about this:
Initially, Console was envisioned as a tool to be used exclusively by Operations Staff for Production Monitoring, however the overwhelming response from users has been that it actually has more applicability and value during application development; leaving the production monitoring aspects to the vendors already embedded in companies infrastructures.
There are no exceptions on startup.
I'm using atmos in conjunction with spray. A lot of actors are being created (~1 every 6 seconds - they later die when the connection is closed).
I've noticed that the dashboard becomes less responsive as the actor count increases (this includes dead actors). Maybe the reason why the dashboard looks 'empty' is that it's taking a long time to render the UI? (Because it's pulling a lot of data down)?
If I restart, the dashboard suddenly comes up very fast and is responsive.
I should mention: I have another instance with atmos that runs fine - I'm guessing this is because there are a constant number of actors (4 or so).
I took a look at the websocket connection atmos is listening on. Seems like it does fine until the length of the response hits ~212000, at which point it starts throwing 500s.
Ah...I think the issue is atmos is running out of memory.
2013-10-18 17:19:37,460 WARN [akka://query/user/IO-HTTP/listener-0/424] [akka://query/user/IO-HTTP/listener-0/424] [query-akka.actor.default-dispatcher-7] : Configured registration timeout of 1 second expired, stopping
2013-10-18 17:19:37,460 WARN [akka://query/user/IO-HTTP/listener-0/426] [akka://query/user/IO-HTTP/listener-0/426] [query-akka.actor.default-dispatcher-7] : Configured registration timeout of 1 second expired, stopping
2013-10-18 17:19:37,461 WARN [akka://query/user/IO-HTTP/listener-0/425] [akka://query/user/IO-HTTP/listener-0/425] [query-akka.actor.default-dispatcher-7] : Configured registration timeout of 1 second expired, stopping
2013-10-18 17:19:37,461 WARN [akka://query/user/IO-HTTP/listener-0/429] [akka://query/user/IO-HTTP/listener-0/429] [query-akka.actor.default-dispatcher-7] : Configured registration timeout of 1 second expired, stopping
2013-10-18 17:19:37,461 WARN [akka://query/user/IO-HTTP/listener-0/427] [akka://query/user/IO-HTTP/listener-0/427] [query-akka.actor.default-dispatcher-7] : Configured registration timeout of 1 second expired, stopping
2013-10-18 17:19:51,112 WARN [akka://query/user/IO-HTTP/listener-0/428] [akka://query/user/IO-HTTP/listener-0/428] [query-akka.actor.default-dispatcher-7] : Configured registration timeout of 1 second expired, stopping
2013-10-18 17:20:04,847 ERROR [Jg] [akka://query/user/gatewayActor/metadataStatsResource] [query-akka.actor.default-dispatcher-8] : Futures timed out after [5 seconds]
java.util.concurrent.TimeoutException: Futures timed out after [5 seconds]
at aKK.a(atmos:96) ~[atmos-dev-1.3.1.jar:na]
at aKK.c(atmos:100) ~[atmos-dev-1.3.1.jar:na]
at aJw.apply(atmos:107) ~[atmos-dev-1.3.1.jar:na]
at gV.a(atmos:173) ~[atmos-dev-1.3.1.jar:na]
at aKo.a(atmos:3640) [atmos-dev-1.3.1.jar:na]
at gU.a(atmos:171) ~[atmos-dev-1.3.1.jar:na]
at aJu.b(atmos:107) ~[atmos-dev-1.3.1.jar:na]
at JC.f(atmos:273) ~[atmos-dev-1.3.1.jar:na]
at JC.g(atmos:296) ~[atmos-dev-1.3.1.jar:na]
at Jg.b(atmos:104) ~[atmos-dev-1.3.1.jar:na]
at Ls.a(atmos:13) ~[atmos-dev-1.3.1.jar:na]
at i.a(atmos:425) [atmos-dev-1.3.1.jar:na]
at i.b(atmos:386) [atmos-dev-1.3.1.jar:na]
at gE.a(atmos:230) [atmos-dev-1.3.1.jar:na]
at gE.run(atmos:212) [atmos-dev-1.3.1.jar:na]
at gC.b(atmos:506) [atmos-dev-1.3.1.jar:na]
at aKv.d(atmos:260) [atmos-dev-1.3.1.jar:na]
at aKu.d(atmos:1339) [atmos-dev-1.3.1.jar:na]
at aKo.b(atmos:1979) [atmos-dev-1.3.1.jar:na]
at aKA.run(atmos:107) [atmos-dev-1.3.1.jar:na]
2013-10-18 17:20:42,221 ERROR [Sh] [akka://query/user/gatewayActor/licenseActor] [query-akka.actor.default-dispatcher-8] : Futures timed out after [5 seconds]
java.util.concurrent.TimeoutException: Futures timed out after [5 seconds]
at aKK.a(atmos:96) ~[atmos-dev-1.3.1.jar:na]
at aKK.c(atmos:100) ~[atmos-dev-1.3.1.jar:na]
at aJw.apply(atmos:107) ~[atmos-dev-1.3.1.jar:na]
at gV.a(atmos:173) ~[atmos-dev-1.3.1.jar:na]
at aKo.a(atmos:3640) [atmos-dev-1.3.1.jar:na]
at gU.a(atmos:171) ~[atmos-dev-1.3.1.jar:na]
at aJu.b(atmos:107) ~[atmos-dev-1.3.1.jar:na]
at JC.f(atmos:273) ~[atmos-dev-1.3.1.jar:na]
at JC.c(atmos:151) ~[atmos-dev-1.3.1.jar:na]
at Sh.p(atmos:59) ~[atmos-dev-1.3.1.jar:na]
at Si.a(atmos:40) ~[atmos-dev-1.3.1.jar:na]
at i.a(atmos:425) [atmos-dev-1.3.1.jar:na]
at i.b(atmos:386) [atmos-dev-1.3.1.jar:na]
at gE.a(atmos:230) [atmos-dev-1.3.1.jar:na]
at gE.run(atmos:212) [atmos-dev-1.3.1.jar:na]
at gC.b(atmos:506) [atmos-dev-1.3.1.jar:na]
at aKv.d(atmos:260) [atmos-dev-1.3.1.jar:na]
at aKu.d(atmos:1339) [atmos-dev-1.3.1.jar:na]
at aKo.b(atmos:1979) [atmos-dev-1.3.1.jar:na]
at aKA.run(atmos:107) [atmos-dev-1.3.1.jar:na]
2013-10-18 17:23:49,735 ERROR [Jg] [akka://query/user/gatewayActor/metadataStatsResource] [query-akka.actor.default-dispatcher-13] : Futures timed out after [5 seconds]
java.util.concurrent.TimeoutException: Futures timed out after [5 seconds]
at aKK.a(atmos:96) ~[atmos-dev-1.3.1.jar:na]
at aKK.c(atmos:100) ~[atmos-dev-1.3.1.jar:na]
at aJw.apply(atmos:107) ~[atmos-dev-1.3.1.jar:na]
at gV.a(atmos:173) ~[atmos-dev-1.3.1.jar:na]
at aKo.a(atmos:3640) [atmos-dev-1.3.1.jar:na]
at gU.a(atmos:171) ~[atmos-dev-1.3.1.jar:na]
at aJu.b(atmos:107) ~[atmos-dev-1.3.1.jar:na]
at JC.f(atmos:273) ~[atmos-dev-1.3.1.jar:na]
at JC.g(atmos:296) ~[atmos-dev-1.3.1.jar:na]
at Jg.b(atmos:104) ~[atmos-dev-1.3.1.jar:na]
at Ls.a(atmos:13) ~[atmos-dev-1.3.1.jar:na]
at i.a(atmos:425) [atmos-dev-1.3.1.jar:na]
at i.b(atmos:386) [atmos-dev-1.3.1.jar:na]
at gE.a(atmos:230) [atmos-dev-1.3.1.jar:na]
at gE.run(atmos:212) [atmos-dev-1.3.1.jar:na]
at gC.b(atmos:506) [atmos-dev-1.3.1.jar:na]
at aKv.d(atmos:260) [atmos-dev-1.3.1.jar:na]
at aKu.d(atmos:1339) [atmos-dev-1.3.1.jar:na]
at aKo.b(atmos:1979) [atmos-dev-1.3.1.jar:na]
at aKA.run(atmos:107) [atmos-dev-1.3.1.jar:na]
2013-10-18 17:23:49,736 ERROR [U] [ActorSystem(query)] [query-akka.actor.default-dispatcher-7] : Uncaught error from thread [query-akka.actor.default-dispatcher-7] shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled
java.lang.OutOfMemoryError: GC overhead limit exceeded
at aCr.c(atmos:241) ~[atmos-dev-1.3.1.jar:na]
at aCr.c(atmos:227) ~[atmos-dev-1.3.1.jar:na]
at aCr.c(atmos:227) ~[atmos-dev-1.3.1.jar:na]
at aCm.g(atmos:56) ~[atmos-dev-1.3.1.jar:na]
at aCm.d(atmos:33) ~[atmos-dev-1.3.1.jar:na]
at aIR.c(atmos:24) ~[atmos-dev-1.3.1.jar:na]
at aIR.h_(atmos:22) ~[atmos-dev-1.3.1.jar:na]
at aBm.a(atmos:48) ~[atmos-dev-1.3.1.jar:na]
at aBm.apply(atmos:48) ~[atmos-dev-1.3.1.jar:na]
at aCC.l(atmos:318) ~[atmos-dev-1.3.1.jar:na]
at aBn.a(atmos:48) ~[atmos-dev-1.3.1.jar:na]
at aIR.b(atmos:22) ~[atmos-dev-1.3.1.jar:na]
at ayN.a(atmos:629) ~[atmos-dev-1.3.1.jar:na]
at atd.a(atmos:105) ~[atmos-dev-1.3.1.jar:na]
at azb.h(atmos:267) ~[atmos-dev-1.3.1.jar:na]
at atd.K_(atmos:105) ~[atmos-dev-1.3.1.jar:na]
at Rb.a(atmos:36) ~[atmos-dev-1.3.1.jar:na]
at JV.a(atmos:271) ~[atmos-dev-1.3.1.jar:na]
at JV.apply(atmos:271) ~[atmos-dev-1.3.1.jar:na]
at aKH.b(atmos:24) ~[atmos-dev-1.3.1.jar:na]
at aKH.run(atmos:24) ~[atmos-dev-1.3.1.jar:na]
at hk.run(atmos:137) ~[atmos-dev-1.3.1.jar:na]
at gC.b(atmos:506) [atmos-dev-1.3.1.jar:na]
at aKv.d(atmos:260) [atmos-dev-1.3.1.jar:na]
at aKu.d(atmos:1339) [atmos-dev-1.3.1.jar:na]
at aKo.b(atmos:1979) [atmos-dev-1.3.1.jar:na]
at aKA.run(atmos:107) [atmos-dev-1.3.1.jar:na]
2013-10-18 17:24:02,487 WARN [akka://query/user/IO-HTTP/listener-0/441] [akka://query/user/IO-HTTP/listener-0/441] [query-akka.actor.default-dispatcher-13] : Configured registration timeout of 1 second expired, stopping
2013-10-18 17:24:02,487 WARN [akka://query/user/IO-HTTP/listener-0/440] [akka://query/user/IO-HTTP/listener-0/440] [query-akka.actor.default-dispatcher-13] : Configured registration timeout of 1 second expired, stopping
2013-10-18 17:24:02,488 WARN [akka://query/user/IO-HTTP/listener-0/442] [akka://query/user/IO-HTTP/listener-0/442] [query-akka.actor.default-dispatcher-9] : Configured registration timeout of 1 second expired, stopping
I stopped tracing the spray HTTP actors and everything is working great now!
Great to hear.
If you do need to increase the memory for Atmos to handle the amount of data, you can do this with the atmosJvmOptions
setting. For example:
AtmosKeys.atmosJvmOptions in Atmos := Seq("-Xms2G", "-Xmx2G")
I have a similar issue, but I don't see these memory exceptions (I've put the atmostJvmOptions
), I am also running on openjdk, but can't easily switch to oracle.
During the system start, the console works fine, monitors things ok for a few minutes, then it start returning "empty" pages. That is, I can move between pages, but the number/graphs are empty, the Actors list is empty, etc.
From looking at my logs, I can see messages are still passed around, but nothing on the console. I am running a system with "many" actors (around 3K reported by the console before it stops), could this be the issue? when the system is quiet, with few actors running and few messages, the console stays up longer.
I have tried to put:
atmos.trace.sampling {
"*" = 5000
}
in application.conf, but it doesn't make a difference.
Have you tried inspecting heap usage with jvisualvm? Atmos is composed of a backend and a Netty server. It's possible the backend may have died / effectively ran out of memory, which is why the dashboard seems to be empty.
I found out that if I wait long enough, the console might come back for a short (less than a minute) time and then drops again. What I've found at that point is that I have many deviations as I just ignore many many (~7K) messages and they get unhandled
, maybe Atmos just fills up with this?
The app has many actors, but not much memory or cpu usage, from visualvm profiling, it the app without Atmos doesn't go over 500MB, it can get up to 1GB before GC, but that's after a few hours of running and Atmos dies just after a few minutes. With Atmos running, it goes up to 5GB before GC, but the JVM is set up to max at 17GB (that's for my app). The AtmosDev process seems to be constant at just under 70MB of the 1GB jvm limit.
I have tried to put AtmosKeys.atmosJvmOptions in Atmos := Seq("-Xms2G", "-Xmx2G")
, but it doesn't seem to be taken into account. Where shoud I put this exactly?
It could very well be that Atmos is using that amount of memory. You seem to have quite a lot of things going on in your app so the number of trace events generated during tracing will large, hence the heavy GC-ing you see in the heap image above.
Since the Console (UI) seems to hang can you also see if the REST API is responsive or not. You can do so by pointing your browser to: http://localhost:8660/monitoring/metadata?rolling=20minutes Also, could you perhaps post the result of the above call so I get a sense for the load of your traced application.
Cheers
After running for a period of time (< 10 hours), the dashboard will become "empty" - there are no numbers or graphs displayed anywhere, although the dashboard is fully responsive.