ow2-proactive / scheduling

Multi-platform Scheduling and Workflows Engine
http://www.activeeon.com/workflows-scheduling
GNU Affero General Public License v3.0
62 stars 54 forks source link

OOM under no load (dump added) #3282

Open febael opened 6 years ago

febael commented 6 years ago

Bonjour Mesdames et Messieurs,

We get OOM errors at server. dump file

"New I/O worker #2" prio=5 tid=16 RUNNABLE at java.lang.OutOfMemoryError.(OutOfMemoryError.java:48) at org.jboss.netty.buffer.HeapChannelBuffer.(HeapChannelBuffer.java:42) at org.jboss.netty.buffer.BigEndianHeapChannelBuffer.(BigEndianHeapChannelBuffer.java:34) at org.jboss.netty.buffer.ChannelBuffers.buffer(ChannelBuffers.java:134) at org.jboss.netty.buffer.HeapChannelBufferFactory.getBuffer(HeapChannelBufferFactory.java:68) at org.jboss.netty.buffer.AbstractChannelBufferFactory.getBuffer(AbstractChannelBufferFactory.java:48) at org.objectweb.proactive.extensions.pnp.PNPServerFrameDecoder.getBuffer(PNPServerFrameDecoder.java:194) at org.objectweb.proactive.extensions.pnp.PNPServerFrameDecoder.decode(PNPServerFrameDecoder.java:138) at org.objectweb.proactive.extensions.pnp.PNPServerFrameDecoder.callDecode(PNPServerFrameDecoder.java:104) Local Variable: java.net.InetSocketAddress#157 at org.objectweb.proactive.extensions.pnp.PNPServerFrameDecoder.handleUpstream(PNPServerFrameDecoder.java:86) Local Variable: org.objectweb.proactive.extensions.pnp.PNPServerFrameDecoder#3 Local Variable: org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext#20 Local Variable: org.jboss.netty.buffer.BigEndianHeapChannelBuffer#6 at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) Local Variable: org.jboss.netty.channel.DefaultChannelPipeline#6 Local Variable: org.jboss.netty.channel.UpstreamMessageEvent#1 at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559) at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268) at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255) at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88) Local Variable: org.jboss.netty.channel.socket.nio.NioAcceptedSocketChannel#3 at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108) Local Variable: sun.nio.ch.SelectionKeyImpl#40 Local Variable: java.util.HashMap$KeyIterator#1 at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337) Local Variable: sun.nio.ch.EPollSelectorImpl#4 at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89) at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) Local Variable: org.jboss.netty.channel.socket.nio.NioWorker#3 at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) Local Variable: java.lang.String#8438 Local Variable: java.lang.String#8439 Local Variable: org.jboss.netty.util.ThreadRenamingRunnable#5 at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) Local Variable: org.jboss.netty.util.internal.DeadLockProofWorker$1#4 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) Local Variable: java.util.concurrent.ThreadPoolExecutor$Worker#5 at java.lang.Thread.run(Thread.java:748)

tobwiens commented 6 years ago

@febael Thank you very much for reporting this issue.

Your finding is very interesting. That message received on the network looks faulty, containing basically 0.

How much maximum heap memory do you give the java virtual machine when you start?

@fviale @luispinedamx What do you think about this memory issue?

tobwiens commented 6 years ago

@febael Sadly I cannot analyze your dump file. I downloaded it twice but I get following error: java.io.IOException: Unrecognized magic number: 529205248 at com.sun.tools.hat.internal.parser.Reader.readFile(Reader.java:94) at com.sun.tools.hat.Main.main(Main.java:159)

febael commented 6 years ago

@tobwiens hi,

There's no setting regarding the min/max memory.

Original huge dump file has the expected magic number : xxd java_pid6.hprof | head -1 00000000: 4a41 5641 2050 524f 4649 4c45 2031 2e30 JAVA PROFILE 1.0

Acting on the tar.gz file I get : xxd dump.tar.gz | head -1 00000000: 1f8b 0800 37e4 625b 0003 ec7c 7974 1377 ....7.b[...|yt.w echo $(( 16#1f8b0800 )) 529205248

So I assume you tried to directly open the tar.gz file :)

fviale commented 6 years ago

Hello @febael,

I analysed the heap dump you provided using Eclipse Memory Analyser and it does not show anything special. Like you said, the scheduler seems to be quite empty. No object occupying abnormally the memory (biggest retainer being the JVM class loader, but nothing alarming).

What you describe regarding the PNP message of 1GB is quite surprising though. A few questions below:

febael commented 6 years ago

Thank you @fviale for your analysis. I used the jvisualvm tool (should be shipped with every jdk installation), and I could get all thread dumps and objects in the heap with that.

Do you have any idea on how to monitor the case for more clues?

fviale commented 6 years ago

Thank you for your detailed explanation, and I indeed observes what you describe in JVisualVM.

Apparently, one frame decoded by PNPServerFrameDecoder (part of the low-level proactive network protocol) seems excessively big. The maximum frame size set in the code is simply MAXINT bytes (which is over 2GB).

A low-level PNP message can be many many things, and based on the content of the frame bytes only, it is nearly impossible to find out the origin of the message.

What I suggest is that you change the log level of the PNP protocol to TRACE: scheduler_server/config/log/server.properties

log4j.logger.proactive.pnp = TRACE log4j.logger.proactive.remoteobject = DEBUG

This will be very verbose, but whenever the problem happens again, we might be able to find the provenance of the last message.

febael commented 6 years ago

Thank you @fviale, done that and will inform you with findings.