Worker process cause metaspace OOM after the cluster running for a long time

rice668 commented 2 years ago

We have a trino cluster with 40 nodes running, and all of them hang up suddenly. After checking the log, it is caused by metaspace oom. Then I checked the heap dump and found that there are many ClassLoader objects on the heap and the number of instances of these classes loaded by these classloaders is 0, Please note that these classes have not been unloaded, that is to say, If we can unload these part of these classes, we can reduce the oom in the metaspace area.

I continue to check why the class has not been unloaded, which can be seen from the path to the GC Root (as shown in the figure below). Obviously, guava cache uses StrongValueReference for map entry value. I think this is why it has not been unloaded. (lots of Compiler class in trino use StrongValueReference, I just put one here)

The reason is that the reference of the value of the guava cache we use is StrongValueReference. Although trino has limited the size of the guava cache, I think this is not enough. Setting the size is a relatively random behavior. Please allow me to say that, If your memory is only 10MB, the default cache size of 1000 or 10000 is obviously wrong. Of course, the actual scene will not be like this. I just want to express that the behavior of setting the size seems inappropriate.

Based on this situation, we used our own test framework (can record the running time of each query) did some performance and stability verification in our test cluster. The specific test methods are as follows

Our test cluster has 5 nodes, 4 workers, and a coordinator. Since all queries executed in the production environment will be saved in the TIDB database by us. So I chose a fixed 20,000 queries to run with a concurrency of 50, which shows that the selected query set is representative because they are actual queries from production.

We have 2 sets of codes for comparison. One is StrongValueReference (we call this group A), another is SoftValue (B group). The test results showed that the cluster where group A was crashed in less than an hour whereas the group B had been running for 5 hours and we can not see these classes any more.

By the way, the important jvm configuration of the test,

Coordinator
-server -Xms160G -Xmx160G
-XX:MetaspaceSize=1500M -XX:MaxMetaspaceSize=1500M

worker
  -server -Xms160G -Xmx160G
-XX:MetaspaceSize=300M -XX:MaxMetaspaceSize=300M

After changing the cache to Soft reference, will it affect the performance of the query ? I am currently doing this test. When the data comes out, I will put the distribution of the query time here. Since there are a lot of queries in the test, nearly 20,000 queries, it will take some time.

Finally, I would like to mention that the size of the cache currently used for the PageFunctionCompiler class is 10,000. From the above figure, we can find that the retained memory size of a value entry is 13K(13240bytes) compared to itself 72 bytes, so we can save 130MB of cache memory in heap before the virtual machine throws an OutOfMemoryError if SoftValue is used. Of course, this has nothing to do with metaspace, but I still want to mention this optimization.

I've had some discussions with @findepi in slack before, and now I'm putting it on github in the hope that more people engaged in this thread.

rice668 commented 2 years ago

Due to the large number of data records (18110 queries), excel cannot draw such a large amount of data, so I divided it into two parts. The value of the abscissa shows the number of queries(e.g. 1 means the first query, 176 means the 176th query ). The vertical axis is the query time in milliseconds. It can be found that in terms of the execution time distribution of 18,110 queries, the cache modified based on soft reference does not show performance regression.

findepi commented 2 years ago

I've had some discussions with @findepi in slack before

reference: https://trinodb.slack.com/archives/CP1MUNEUX/p1669468559120469

and now I'm putting it on github in the hope that more people engaged in this thread.

thank you!

trinodb / trino

Worker process cause metaspace OOM after the cluster running for a long time #15232