thegraphnetwork-literev / es-journals

BSD 3-Clause "New" or "Revised" License
2 stars 0 forks source link

Elasticsearch Exited with Error Code 137 - Potential Out of Memory Issue #16

Open esloch opened 4 months ago

esloch commented 4 months ago

Description:

Create a script to ensure the Elasticsearch service (Es-Journals) on the Staging server remains operational and data integrity is maintained. This script should address the following issues:

  1. Healthcheck Monitoring:

    • Continuously monitor the health of the Elasticsearch container.
    • Send an alert when the container is offline or encounters an error.
  2. Service Restoration:

    • Attempt to automatically restore the Elasticsearch service when the container is detected as offline.
    • Log the actions taken to restore the service and notify the appropriate personnel.
  3. Scheduler Integration:

    • Before executing the download scheduler, check the status of the Elasticsearch container.
    • If the service is offline, attempt to restore it before proceeding with any database operations.
  4. Data Integrity:

    • Prevent the scheduler from downloading and overwriting previous databases if the Elasticsearch service is not operational.
    • Ensure that only the latest successfully downloaded database is indexed when the service is restored.
    • Implement a mechanism to collect and verify all files from a database checkpoint to the current date, ensuring all indexed data matches the latest collected data from medrxiv and biorxiv.

Problem Background:

The Elasticsearch service has been encountering issues where it exits with code 137, likely due to running out of memory. This results in the container crashing and the service becoming unavailable. Despite the service being down, the scheduler continues to download and overwrite previous databases, leading to incomplete reindexing when the service is restored.

Logs:

{"@timestamp":"2024-07-08T01:38:00.026Z", "log.level": "INFO", "message":"Successfully completed [ML] maintenance task: triggerDeleteExpiredDataTask", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[es][ml_utility][T#1]","log.logger":"org.elasticsearch.xpack.ml.MlDailyMaintenanceService","elasticsearch.cluster.uuid":"XN2GC8otQa2ohR4EGWjZYw","elasticsearch.node.id":"ikL1Yy1lRSGdl1uNYkMlAg","elasticsearch.node.name":"es","elasticsearch.cluster.name":"docker-cluster"}

ERROR: Elasticsearch exited unexpectedly, with exit code 137

Potential Cause:

Exit code 137 indicates the process was killed by SIGKILL, likely due to running out of memory .

Additional Information

Code of Conduct

xmnlab commented 4 months ago

when the service is not reachable, it could raise an issue on sentry .. so it could be logged in a platform where anyone will have access