opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.55k stars 1.75k forks source link

[BUG] OpenSearch temporarily freezes during long import since version 2.14 #14622

Closed HoffmannTom closed 2 months ago

HoffmannTom commented 2 months ago

Describe the bug

We have an import job which indexes around 200 000 documents. A java-client is using the bulk API. After 3 to 4 minutes (around 150 000 - 170 000 documents), the OpenSearch server freezes for 20 - 30 seconds and then continues normal operation.

My observatoins so far:

  1. Imports starts via java-client bulk api
  2. after 3-4 minutes or 150 000 - 170 000 indexed documents, OS temporarily freezes. the CPU usage drops to around 0
  3. Memory almost stops growing
  4. The java client gets a java.net.SocketTimeoutException exception
  5. On the server side, I see a corresponding error message java.net.SocketException: Connection reset
  6. Sometimes, I see warnings like: health check of [/var/lib/opensearch/nodes/0] took [46246ms] which is above the warn threshold of [5s] Received response for a request that has timed out, sent [17817ms] ago, timed out [2803ms] ago ....
  7. After 20 - 30 seconds, the OS server continues normal operation

image

I checked the /proc/pid/fd entries, which stay almost constant around 1300. The syslog doesn't show any errors. The node error-log only shows the log entries mentioned above (Connection reset, timeout warnings)

Upgrading to 2.15 didn't solve the issue. The issue didn't show up with version 2.13. OS: Ubuntu 22 LTS. OpenJDK 64-Bit Server VM Temurin-21.0.3+9

Any hints about how to narrow down the issue are welcome.

Related component

Indexing

To Reproduce

Currently no sample project for reproducing

Expected behavior

No freezes during (bulk) import.

Additional Details

Plugins opensearch-alerting opensearch-anomaly-detection opensearch-asynchronous-search opensearch-cross-cluster-replication opensearch-custom-codecs opensearch-flow-framework opensearch-geospatial opensearch-index-management opensearch-job-scheduler opensearch-knn opensearch-ml opensearch-neural-search opensearch-notifications opensearch-notifications-core opensearch-observability opensearch-performance-analyzer opensearch-reports-scheduler opensearch-security opensearch-security-analytics opensearch-skills opensearch-sql

Host/Environment (please complete the following information):

Additional context Add any other context about the problem here. os-stack-1.txt os-stack-2.txt os-stack-3.txt

HoffmannTom commented 2 months ago

Seems to be a hardware issue. sorry.

HoffmannTom commented 2 months ago

If somebody has a similar issue: Our issue was caused by the dm-crypt layer. The read/write-queues caused the blocking and needed to be disabled: https://unix.stackexchange.com/questions/724104/disable-read-write-workqueue-for-ubuntu-full-disk-encryption