pegasystems / pega-helm-charts

Orchestrate a Pega Platform™ deployment by using Docker, Kubernetes, and Helm to take advantage of Pega Platform Cloud Choice flexibility.
https://community.pega.com/knowledgebase/articles/cloud-choice
Apache License 2.0
124 stars 198 forks source link

Persistence of GC logs and Heap dumps in case of system failures. #231

Closed yashwanth-pega closed 4 months ago

yashwanth-pega commented 3 years ago

Requirement 1: Currently, garbage collection logs are frequently emitted to the mentioned location(say, /usr/local/tomcat/logs/) in local file system. In case of system failures or JVM crashes, these logs help us in diagnosing the issue from a Garbage collection viewpoint. However, when the pod crashes because of any such failures, the logs are lost. Requirement 2: When dealing with the system failures related to OutOfMemoryError(OOME) issues, Heapdump gives us precious insights into the issue. The collection of frequent heap dumps isn't practical. Fortunately, automated heap dump collection is done(in case of OOME) if we use the JVM setting -XX:+HeapDumpOnOutOfMemoryError. However, on the pod crash, the collected dumps are lost(same as the situation with GC logs).

Possible solutions:

  1. A mechanism to persist this data frequently(Apparently, an overkill)
  2. Write the data to persistent storage(say, s3 buckets) before the pod crashes.

JVM provides a way to execute a command/script before the JVM is down with OOME if we use the setting -XX:OnOutOfMemoryError=. This gives us the chance(to execute a script) to persist the required data(GC logs and Heap dump), before the pod crash.

Note: Also consider how this problem is dealt with in Pega cloud systems.

yashwanth-pega commented 3 years ago

Refer to the respective JVM parameters in the following document: https://pegasystems.sharepoint.com/sites/ScalableExecutionEngineEaaS/SitePages/JVM-Flags-Analysis.aspx?web=1

yashwanth-pega commented 3 years ago

Note that because of SDEA images being dependent on jdk8, the JVM Argument for GC log collection, which is jdk9+, was reverted. Refer to the following : https://github.com/pegasystems/docker-pega-web-ready/pull/96

This (addition of the GC logs flag) needs to be addressed as part of this or a different issue, as the flag is the prerequisite for one of the stated requirements.

yashwanth-pega commented 3 years ago

Related issue: https://github.com/pegasystems/pega-helm-charts/issues/125

APegaDavis commented 1 year ago

@yashwanth-pega could you override the heap dump location via custom env variable (part of the tier definition and used here to write to a persistent volume?

kishorv10 commented 4 months ago

Fixed in #726