[BUG] Heap space goes out of memory and the node crashes when taking snapshots

nilushancosta commented 2 months ago

Describe the bug

When I try to take a snapshot to an Azure Storage account (by using the repository-azure plugin), the data node that is carrying out the snapshot process crashes resulting in the snapshot to fail.

A short while after starting the snapshot, it fails. Snapshot details API shows node shutdown as the reason for the failure. E.g.

"failures": [
        {
          "index": "logs-2024-06-20",
          "index_uuid": "logs-2024-06-20",
          "shard_id": 0,
          "reason": "node shutdown",
          "node_id": "7WaZsHPmQ4apTHxhR4qgGQ",
          "status": "INTERNAL_SERVER_ERROR"
        }
]

The heap memory of the data node that is taking the snapshot goes out of memory causing the pod to crash and restart. Following is the log printed in a data node before it crashed

[2024-07-05T06:13:21,807][INFO ][o.o.j.s.JobSweeper       ] [opensearch-data-0] Running full sweep
[2024-07-05T06:13:22,520][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [opensearch-data-0] Detected cluster change event for destination migration
[2024-07-05T06:13:25,039][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [opensearch-data-0] attempting to trigger G1GC due to high heap usage [844187136]
[2024-07-05T06:13:25,059][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [opensearch-data-0] GC did bring memory usage down, before [844187136], after [714688000], allocations [186], duration [20]
[2024-07-05T06:13:30,133][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [opensearch-data-0] attempting to trigger G1GC due to high heap usage [1004094976]
[2024-07-05T06:13:30,146][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [opensearch-data-0] GC did bring memory usage down, before [1004094976], after [903587328], allocations [33], duration [13]
[2024-07-05T06:13:35,159][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [opensearch-data-0] attempting to trigger G1GC due to high heap usage [898535424]
[2024-07-05T06:13:35,175][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [opensearch-data-0] GC did bring memory usage down, before [898535424], after [662671376], allocations [136], duration [16]
[2024-07-05T06:13:43,346][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [opensearch-data-0] attempting to trigger G1GC due to high heap usage [839589376]
[2024-07-05T06:13:43,364][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [opensearch-data-0] GC did bring memory usage down, before [839589376], after [814423552], allocations [191], duration [18]
[2024-07-05T06:13:46,713][WARN ][o.o.m.j.JvmGcMonitorService] [opensearch-data-0] [gc][67511] overhead, spent [985ms] collecting in the last [1.2s]
java.lang.OutOfMemoryError: Java heap space
Dumping heap to data/java_pid30.hprof ...
[2024-07-05T06:13:48,764][WARN ][o.o.m.j.JvmGcMonitorService] [opensearch-data-0] [gc][67512] overhead, spent [1.9s] collecting in the last [2s]
[2024-07-05T06:13:48,764][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [opensearch-data-0] attempting to trigger G1GC due to high heap usage [1015699088]
Heap dump file created [1188532983 bytes in 7.458 secs]
[2024-07-05T06:13:56,221][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [opensearch-data-0] GC did not bring memory usage down, before [1015699088], after [1016929696], allocations [1], duration [7457]
[2024-07-05T06:13:57,069][ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [opensearch-data-0] fatal error in thread [opensearch[opensearch-data-0][snapshot][T#1]], exiting
java.lang.OutOfMemoryError: Java heap space
    at io.netty.util.internal.PlatformDependent.allocateUninitializedArray(PlatformDependent.java:323) ~[?:?]
    at io.netty.buffer.PoolArena$HeapArena.newByteArray(PoolArena.java:635) ~[?:?]
    at io.netty.buffer.PoolArena$HeapArena.newChunk(PoolArena.java:646) ~[?:?]
    at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:215) ~[?:?]
    at io.netty.buffer.PoolArena.tcacheAllocateSmall(PoolArena.java:180) ~[?:?]
    at io.netty.buffer.PoolArena.allocate(PoolArena.java:137) ~[?:?]
    at io.netty.buffer.PoolArena.allocate(PoolArena.java:129) ~[?:?]
    at io.netty.buffer.PooledByteBufAllocator.newHeapBuffer(PooledByteBufAllocator.java:378) ~[?:?]
    at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:169) ~[?:?]
    at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:160) ~[?:?]
    at io.netty.handler.ssl.SslHandler$SslEngineType$3.allocateWrapBuffer(SslHandler.java:335) ~[?:?]
    at io.netty.handler.ssl.SslHandler.allocateOutNetBuf(SslHandler.java:2364) ~[?:?]
    at io.netty.handler.ssl.SslHandler.wrap(SslHandler.java:866) ~[?:?]
    at io.netty.handler.ssl.SslHandler.wrapAndFlush(SslHandler.java:821) ~[?:?]
    at io.netty.handler.ssl.SslHandler.flush(SslHandler.java:802) ~[?:?]
    at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:925) ~[?:?]
    at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:907) ~[?:?]
    at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:893) ~[?:?]
    at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.flush(CombinedChannelDuplexHandler.java:531) ~[?:?]
    at io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:125) ~[?:?]
    at io.netty.channel.CombinedChannelDuplexHandler.flush(CombinedChannelDuplexHandler.java:356) ~[?:?]
    at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:923) ~[?:?]
    at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:907) ~[?:?]
    at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:893) ~[?:?]
    at reactor.netty.channel.MonoSendMany$SendManyInner.run(MonoSendMany.java:325) ~[?:?]
    at reactor.netty.channel.MonoSendMany$SendManyInner.trySchedule(MonoSendMany.java:434) ~[?:?]
    at reactor.netty.channel.MonoSendMany$SendManyInner.onNext(MonoSendMany.java:223) ~[?:?]
    at reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:122) ~[?:?]
    at reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:122) ~[?:?]
    at reactor.core.publisher.FluxHandle$HandleSubscriber.onNext(FluxHandle.java:128) ~[?:?]
    at reactor.core.publisher.FluxConcatArray$ConcatArraySubscriber.onNext(FluxConcatArray.java:201) ~[?:?]
    at reactor.core.publisher.FluxIterable$IterableSubscription.slowPath(FluxIterable.java:335) ~[?:?]
fatal error in thread [opensearch[opensearch-data-0][snapshot][T#1]], exiting
java.lang.OutOfMemoryError: Java heap space
    at io.netty.util.internal.PlatformDependent.allocateUninitializedArray(PlatformDependent.java:323)
    at io.netty.buffer.PoolArena$HeapArena.newByteArray(PoolArena.java:635)
    at io.netty.buffer.PoolArena$HeapArena.newChunk(PoolArena.java:646)
    at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:215)
    at io.netty.buffer.PoolArena.tcacheAllocateSmall(PoolArena.java:180)
    at io.netty.buffer.PoolArena.allocate(PoolArena.java:137)
    at io.netty.buffer.PoolArena.allocate(PoolArena.java:129)
    at io.netty.buffer.PooledByteBufAllocator.newHeapBuffer(PooledByteBufAllocator.java:378)
    at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:169)
    at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:160)
    at io.netty.handler.ssl.SslHandler$SslEngineType$3.allocateWrapBuffer(SslHandler.java:335)
    at io.netty.handler.ssl.SslHandler.allocateOutNetBuf(SslHandler.java:2364)
    at io.netty.handler.ssl.SslHandler.wrap(SslHandler.java:866)
    at io.netty.handler.ssl.SslHandler.wrapAndFlush(SslHandler.java:821)
    at io.netty.handler.ssl.SslHandler.flush(SslHandler.java:802)
    at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:925)
    at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:907)
    at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:893)
    at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.flush(CombinedChannelDuplexHandler.java:531)
    at io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:125)
    at io.netty.channel.CombinedChannelDuplexHandler.flush(CombinedChannelDuplexHandler.java:356)
    at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:923)
    at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:907)
    at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:893)
    at reactor.netty.channel.MonoSendMany$SendManyInner.run(MonoSendMany.java:325)
    at reactor.netty.channel.MonoSendMany$SendManyInner.trySchedule(MonoSendMany.java:434)
    at reactor.netty.channel.MonoSendMany$SendManyInner.onNext(MonoSendMany.java:223)
    at reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:122)
    at reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:122)
    at reactor.core.publisher.FluxHandle$HandleSubscriber.onNext(FluxHandle.java:128)
    at reactor.core.publisher.FluxConcatArray$ConcatArraySubscriber.onNext(FluxConcatArray.java:201)
    at reactor.core.publisher.FluxIterable$IterableSubscription.slowPath(FluxIterable.java:335)

Initially the OpenSearch cluster had the following resources allocated for each data node.

requests:
    cpu: "500m"
    memory: "1400Mi"
limits:
    cpu: "800m"
    memory: "1700Mi"

So these data nodes had a heap of 700Mi as the heap will be set to 50% of requested memory by default. When I tried to take a snapshot of one index (which had a primary shard of 1.1GB and 2 replica shards), the data node crashed with the above error.

When I increased the resources to the following (which results in a heap size of 1Gi), I was able to take the snapshot with one index.

requests:
    cpu: "500m"
    memory: "2000Mi"
limits:
    cpu: "800m"
    memory: "2100Mi"

But when I tried to take a snapshot with more indexes using the above resources, the data node’s Java heap went out of memory again. I tried to do it several time but they all resulted in the snapshot failing due to the above heap memory issue.

Related component

Storage:Snapshots

To Reproduce

Set the following resources in data nodes

requests:
    cpu: "500m"
    memory: "1400Mi"
limits:
    cpu: "800m"
    memory: "1700Mi"

Try to take a snapshot of an index which has a shard size more than 1.1GB

Expected behavior

Expected the snapshot to complete

Additional Details

Host/Environment (please complete the following information): Kubernetes 1.28.5 OpenSearch 2.11.1 OpenSearch Operator 2.6.0

Additional context The OpenSearch cluster runs with 3 dedicated master nodes and 3 dedicated data nodes. Container logs are collected by Fluent Bit and sent to OpenSearch. They are published to a daily index. There is 1 primary shard and 2 replica shards per index. An Index State Management policy deletes indexes older than 30 days.

nilushancosta commented 2 months ago

I did another snapshot test today.

For this test I set the resources of the data node pods to the following.

Limits:
  cpu: 1500m
  memory: 2100Mi
Requests:
  cpu: 1200m
  memory: 2000Mi

Then I chose an index whose primary store size is 11.9GB.

First I triggered a snapshot of that index to an Azure storage account repository. The snapshot process started. It was in an IN_PROGRESS state for about one minute and then failed with the same node_shutdown error. The data node had the same heap memory out of memory error leading to a pod restart.

Next, I triggered a snapshot of that same index to an S3 repository. This ran for about 6minutes and the snapshot succeeded. There were no node crashes.

So the problem appears to be with snapshots to Azure storage accounts. Could it be a memory leak in the repository-azure plugin?

Furthermore, from what we understand, circuit breakers should prevent the heap going out of memory. But in this case, that also did not happen

gulgulni commented 2 months ago

In the sample above can we share more details on the the number of snapshot threads that has been configured per container.

nilushancosta commented 2 months ago

@gulgulni, Are you looking for this value? I got it from cluster settings using the /_cluster/settings API

"defaults": {
   "snapshot.max_concurrent_operations": "1000"
}

If not, could you let me know where I can get that detail from?

nilushancosta commented 1 month ago

I tested this with several OpenSearch versions and these are the results. 2.15.0 - Issue exists 2.11.1 - Issue exists 2.9.0 - Issue exists 2.8.0 - Issue does not exist. snapshot works

linuxpi commented 1 month ago

[Storage Triage - attendees 1 2 3 4 5 6 7 8]

@nilushancosta could you share a histogram heap dump, it would help analyze whats causing this

nilushancosta commented 1 month ago

@linuxpi please find the histogram of the heap dump below. This particular data node pod had 2000Mi as requests and 2100Mi as limits. And -Xmx1000M -Xms1000M

Please also find the leak suspect report

linuxpi commented 1 month ago

@nilushancosta looks like 'Problem suspect 1' is worth diving into. Can you share the stacktrace and stacktrace with involved local variables ?

nilushancosta commented 1 month ago

@linuxpi please find the stack trace below

opensearch[opensearch-data-0][snapshot][T#1]
  at java.lang.OutOfMemoryError.<init>()V (OutOfMemoryError.java:48)
  at com.azure.storage.common.Utility.lambda$convertStreamToByteBuffer$1([JIJLjava/io/InputStream;Ljava/io/InputStream;Lreactor/core/publisher/SynchronousSink;)Ljava/io/InputStream; (Utility.java:264)
  at com.azure.storage.common.Utility$$Lambda$6369+0x00007f4b211d95a8.apply(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object; ()
  at reactor.core.publisher.FluxGenerate$GenerateSubscription.slowPath(J)V (FluxGenerate.java:271)
  at reactor.core.publisher.FluxGenerate$GenerateSubscription.request(J)V (FluxGenerate.java:213)
  at reactor.core.publisher.FluxFilter$FilterSubscriber.request(J)V (FluxFilter.java:186)
  at reactor.core.publisher.FluxConcatMapNoPrefetch$FluxConcatMapNoPrefetchSubscriber.request(J)V (FluxConcatMapNoPrefetch.java:336)
  at reactor.core.publisher.FluxSwitchOnFirst$AbstractSwitchOnFirstMain.onSubscribe(Lorg/reactivestreams/Subscription;)V (FluxSwitchOnFirst.java:499)
  at reactor.core.publisher.FluxConcatMapNoPrefetch$FluxConcatMapNoPrefetchSubscriber.onSubscribe(Lorg/reactivestreams/Subscription;)V (FluxConcatMapNoPrefetch.java:164)
  at reactor.core.publisher.FluxFilter$FilterSubscriber.onSubscribe(Lorg/reactivestreams/Subscription;)V (FluxFilter.java:85)
  at reactor.core.publisher.FluxGenerate.subscribe(Lreactor/core/CoreSubscriber;)V (FluxGenerate.java:85)
  at reactor.core.publisher.FluxDefer.subscribe(Lreactor/core/CoreSubscriber;)V (FluxDefer.java:54)
  at reactor.core.publisher.Mono.subscribe(Lorg/reactivestreams/Subscriber;)V (Mono.java:4485)
  at reactor.core.publisher.Mono.block()Ljava/lang/Object; (Mono.java:1709)
  at com.azure.storage.common.implementation.StorageImplUtils.blockWithOptionalTimeout(Lreactor/core/publisher/Mono;Ljava/time/Duration;)Ljava/lang/Object; (StorageImplUtils.java:133)
  at com.azure.storage.blob.BlobClient.uploadWithResponse(Lcom/azure/storage/blob/options/BlobParallelUploadOptions;Ljava/time/Duration;Lcom/azure/core/util/Context;)Lcom/azure/core/http/rest/Response; (BlobClient.java:337)
  at org.opensearch.repositories.azure.AzureBlobStore.lambda$writeBlob$18(Lcom/azure/storage/blob/BlobClient;Ljava/io/InputStream;JLcom/azure/storage/blob/models/BlobRequestConditions;Lorg/opensearch/common/collect/Tuple;Ljava/lang/String;)V (AzureBlobStore.java:365)
  at org.opensearch.repositories.azure.AzureBlobStore$$Lambda$6356+0x00007f4b211b87b8.executeCouldThrow()V ()
  at org.opensearch.repositories.azure.SocketAccess.lambda$doPrivilegedVoidException$0(Lorg/opensearch/repositories/azure/SocketAccess$StorageRunnable;)Ljava/lang/Void; (SocketAccess.java:82)
  at org.opensearch.repositories.azure.SocketAccess$$Lambda$6297+0x00007f4b210ea090.run()Ljava/lang/Object; ()
  at java.security.AccessController.executePrivileged(Ljava/security/PrivilegedExceptionAction;Ljava/security/AccessControlContext;Ljava/lang/Class;)Ljava/lang/Object; (AccessController.java:807)
  at java.security.AccessController.doPrivileged(Ljava/security/PrivilegedExceptionAction;)Ljava/lang/Object; (AccessController.java:569)
  at org.opensearch.repositories.azure.SocketAccess.doPrivilegedVoidException(Lorg/opensearch/repositories/azure/SocketAccess$StorageRunnable;)V (SocketAccess.java:81)
  at org.opensearch.repositories.azure.AzureBlobStore.writeBlob(Ljava/lang/String;Ljava/io/InputStream;JZ)V (AzureBlobStore.java:364)
  at org.opensearch.repositories.azure.AzureBlobContainer.writeBlob(Ljava/lang/String;Ljava/io/InputStream;JZ)V (AzureBlobContainer.java:133)
  at org.opensearch.repositories.blobstore.BlobStoreRepository.snapshotFile(Lorg/opensearch/index/snapshots/blobstore/BlobStoreIndexShardSnapshot$FileInfo;Lorg/opensearch/repositories/IndexId;Lorg/opensearch/core/index/shard/ShardId;Lorg/opensearch/snapshots/SnapshotId;Lorg/opensearch/index/snapshots/IndexShardSnapshotStatus;Lorg/opensearch/index/store/Store;)V (BlobStoreRepository.java:3622)
  at org.opensearch.repositories.blobstore.BlobStoreRepository.lambda$executeOneFileSnapshot$70(Lorg/opensearch/index/store/Store;Lorg/opensearch/index/snapshots/IndexShardSnapshotStatus;Lorg/opensearch/core/index/shard/ShardId;Lorg/opensearch/index/snapshots/blobstore/BlobStoreIndexShardSnapshot$FileInfo;Lorg/opensearch/repositories/IndexId;Lorg/opensearch/snapshots/SnapshotId;Ljava/util/concurrent/BlockingQueue;Ljava/util/concurrent/Executor;Lorg/opensearch/core/action/ActionListener;)V (BlobStoreRepository.java:3051)
  at org.opensearch.repositories.blobstore.BlobStoreRepository$$Lambda$6534+0x00007f4b21174918.accept(Ljava/lang/Object;)V ()
  at org.opensearch.action.ActionRunnable$2.doRun()V (ActionRunnable.java:89)
  at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun()V (ThreadContext.java:908)
  at org.opensearch.common.util.concurrent.AbstractRunnable.run()V (AbstractRunnable.java:52)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V (ThreadPoolExecutor.java:1136)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run()V (ThreadPoolExecutor.java:635)
  at java.lang.Thread.run()V (Thread.java:833)

The stacktrace with involved local variables is expandable as shown below. Therefore could you please let me know which ones you need? Screenshot 2024-08-09 at 01 05 31

nilushancosta commented 1 week ago

@linuxpi , did you get a chance to look into this?

opensearch-project / OpenSearch