orientechnologies / orientdb

OrientDB is the most versatile DBMS supporting Graph, Document, Reactive, Full-Text and Geospatial models in one Multi-Model product. OrientDB can run distributed (Multi-Master), supports SQL, ACID Transactions, Full-Text indexing and Reactive Queries.
https://orientdb.dev
Apache License 2.0
4.75k stars 871 forks source link

(Customer 39) How to Fully Stop Enterprise Agent and its Profiler from consuming Java heap memory? #2661

Closed lloydchang closed 10 years ago

lloydchang commented 10 years ago

To @lvca @enisher @laa Cc @henryzhao81 @mattaylor @pmoorhead @hcmwork

( Also sent this to Luca by e-mail )

High Priority Questions; this issue is impacting us severely:

  1. How to Fully Stop Enterprise Agent and its Profiler from consuming Java heap memory?
  2. How to Fully Disable Enterprise Agent and its Profiler while Java VM is actively running and performing garbage collection?
  3. How to Fully Disable Enterprise Agent and its Profiler from consuming Java heap memory?
  4. Do we have to change server configuration XML file content: From: entry name="profiler.enabled" value="true" To: entry name="profiler.enabled" value="false" And stop Java VM and OrientDB server, then start Java VM and OrientDB server anew?
  5. Enterprise Agent jar / zip file was already removed from plugins directory, and OrientDB server logs had reported the agent being unloaded. However, we are seeing excessive Java Heap Memory Usage and Java VM Garbage Collection activity. After executing a Java VM thread and memory dumps, then analyzing the dumps, we see that Enterprise Agent is still consuming heap memory and plausibly inducing garbage collection frequently; details below. Why is following happening?

Details: Today, we are seeing a problem on our OrientDB server; its average load is over 6; server is extremely slow. Even one record insertion takes longer than 1 or 2 second to finish; this is totally abnormal. Even after database requests stop coming in, average load is still about 5. Linux top -H says there are 8 java threads running for 2 days without killing themselves, and consuming 60% of total CPU resources. After Java thread dump, we figured out those 8 threads are Java garbage collection threads. Then we ran Java memory dump try to find out what kind of objects are in Java heap memory. Following are the Memory Analyzer Tool (MAT) diagram / pie graph; there are 1.4 GB object inside / being held in Java heap memory, and almost 94% of those objects are com.orientechnologies.agent.profiler.OEnterpriseProfiler. It looks like issue is pretty clear: somehow, Enterprise agent is not disabled fully, and we need to figure it out why, and stop generating those objects that consume Java heap memory.

pie-graph-of-problem-suspect-1-and-remainder

Problem Suspect 1

One instance of "com.orientechnologies.agent.profiler.OEnterpriseProfiler" loaded by "java.net.URLClassLoader @ 0x9973fd28" occupies 1,550,022,376 (93.92%)bytes. The memory is accumulated in one instance of"java.util.concurrent.ConcurrentHashMap$Segment[]" loaded by "".

Keywords

java.net.URLClassLoader @ 0x9973fd28

com.orientechnologies.agent.profiler.OEnterpriseProfiler

java.util.concurrent.ConcurrentHashMap$Segment[]

Details »

Shortest Paths To the Accumulation Point

Class Name Shallow Heap Retained Heap

java.util.concurrent.ConcurrentHashMap$Segment[16] @ 0x9993ba18 80 1,546,864,440

segments java.util.concurrent.ConcurrentHashMap @ 0x9993b9e8 48 1,546,864,488

chronos com.orientechnologies.agent.profiler.OProfilerData @ 0x9993b4c0 48 1,546,868,384

realTime com.orientechnologies.agent.profiler.OEnterpriseProfiler @ 0x99743f20 72 1,550,022,376

this$0 com.orientechnologies.agent.profiler.OEnterpriseProfiler$1 @ 0x99743ee8 40 56

[1] java.util.TimerTask[128] @ 0x99743cd8 528 528

queue java.util.TaskQueue @ 0x99743cc0 24 552

queue java.util.TimerThread @ 0x9996d430 Timer-1 Thread 112 624

PROFILER class com.orientechnologies.orient.enterprise.channel.OChannel @ 0x999d3710 » 16 88

Total: 2 entries Accumulated Objects

Class Name Shallow Heap Retained Heap Percentage

com.orientechnologies.agent.profiler.OEnterpriseProfiler @ 0x99743f20 72 1,550,022,376 93.92%

com.orientechnologies.agent.profiler.OProfilerData @ 0x9993b4c0 48 1,546,868,384 93.73%

java.util.concurrent.ConcurrentHashMap @ 0x9993b9e8 48 1,546,864,488 93.73%

java.util.concurrent.ConcurrentHashMap$Segment[16] @ 0x9993ba18 80 1,546,864,440 93.73%

java.util.concurrent.ConcurrentHashMap$Segment @ 0x999530f0 40 96,876,504 5.87%

java.util.concurrent.ConcurrentHashMap$Segment @ 0x999487a8 40 96,840,640 5.87%

java.util.concurrent.ConcurrentHashMap$Segment @ 0x99953a28 40 96,803,192 5.87%

java.util.concurrent.ConcurrentHashMap$Segment @ 0x9993ba68 40 96,798,064 5.87%

java.util.concurrent.ConcurrentHashMap$Segment @ 0x99941ee8 40 96,781,272 5.86%

java.util.concurrent.ConcurrentHashMap$Segment @ 0x9993c718 40 96,774,072 5.86%

java.util.concurrent.ConcurrentHashMap$Segment @ 0x9995eca8 40 96,769,872 5.86%

java.util.concurrent.ConcurrentHashMap$Segment @ 0x9994a9b0 40 96,727,416 5.86%

java.util.concurrent.ConcurrentHashMap$Segment @ 0x99949920 40 96,722,536 5.86%

java.util.concurrent.ConcurrentHashMap$Segment @ 0x9995f9d8 40 96,709,320 5.86%

java.util.concurrent.ConcurrentHashMap$Segment @ 0x99945088 40 96,688,936 5.86%

java.util.concurrent.ConcurrentHashMap$Segment @ 0x9993f4b8 40 96,652,560 5.86%

java.util.concurrent.ConcurrentHashMap$Segment @ 0x99952b50 40 96,628,864 5.86%

java.util.concurrent.ConcurrentHashMap$Segment @ 0x999623c0 40 96,452,520 5.84%

java.util.concurrent.ConcurrentHashMap$Segment @ 0x9994bc68 40 96,398,048 5.84%

java.util.concurrent.ConcurrentHashMap$Segment @ 0x999407b8 40 96,240,544 5.83%

Total: 16 entries 640 1,546,864,360 0.937 Accumulated Objects by Class

Label Number of Objects Used Heap Size Retained Heap Size java.util.concurrent.ConcurrentHashMap$Segment First 10 of 16 objects 16 640 1,546,864,360

lvca commented 10 years ago

@lloydchang, Enterprise Agent collects information until the Workbench fetch them. Do you have the Workbench open? If it's open, this is a memory leak, otherwise keep a Workbench open with your nodes configured.

In the meanwhile we're working to a couple of additional settings to set the max memory used by the profiler. Once the max memory is reached, old metrics are simply lost.

WDYT?

lvca commented 10 years ago

Fixed in 1.7.9 Agent.

lloydchang commented 10 years ago

We disabled both Enterprise Agent and Profiler permanently -- They caused instability in our systems with warnings, errors, and excessive memory usage. While I understand you are fixing Enterprise Agent issues in OrientDB 1.7.9, we won't use it; we will continue to disable both Enterprise Agent and Profiler -- They cause more problems than they're worth. I hope this direct feedback helps; thanks.

lvca commented 10 years ago

I wrote you 18 days ago with no answer on this. We fixed the issue, I can send you the new version of the agent in the next hours.

lloydchang commented 10 years ago

We used the 18 days to qualify and quantify the stability of our systems that disabled both Enterprise Agent and Profiler permanently.