opensearch-project / data-prepper

Data Prepper is a component of the OpenSearch project that accepts, filters, transforms, enriches, and routes data at scale.
https://opensearch.org/docs/latest/clients/data-prepper/index/
Apache License 2.0
251 stars 184 forks source link

[BUG] Jackson 2.17.0 LockFreePool causes memory issues #4729

Closed JannikBrand closed 1 month ago

JannikBrand commented 1 month ago

Describe the bug

Data Prepper runs in heap OOM issues. This was observed when ingesting OTel metrics via Data Prepper into OpenSearch (~400 metric data points per second).

image

(The picture shows the summed up heap memory of 2 Data Prepper instances. The instances do not crash, since circuit breakers are configured and constantly open.) The memory is taken away from objects sitting in the Old Gen space.

Possible trigger: The issue started to occur when updating from DP version 2.7.0 to 2.8.0.

I created a heap dump:

image

The org.opensearch.dataprepper.pipeline.Pipeline object is taking away almost all the memory. Within the dominator tree, I can trace back the memory consumption to the jackson LockFreePool:

image

There are some known issues with the LockFreePool, e.g. see

if you search around, you'll see that this pool is not working well. Stick with Jackson 2.16 or override the recycler pool to use the thread local one. 2.17.1 goes back to that pool as the default.

I am not sure what jackson version is exactly used within the opensearch sink, but at least we see that the LockFreePool is used.

To Reproduce Steps to reproduce the behavior:

  1. Setup Data Prepper with otel metrics source and processor and the opensearch sink.
  2. ingest OTel metrics
  3. wait (I could not reproduce it reliably in my dev setup, however for some Data Prepper instances in our environment it happens frequently.)

Expected behavior For comparison this is how the heap utilization looks without this issue (same ingestion workload):

image

Environment (please complete the following information):

JannikBrand commented 1 month ago

I think I found the reason why it started to occur for version 2.8.0: The LockFreePool has the parent node OpenSearchClientRefresher (see dominator tree above). The client refresher was added with #4283.

KarstenSchnitter commented 1 month ago

I analysed the issue together with @JannikBrand. We also pulled a thread dump at the time, when the circuit breaker was active and no data was ingested. We could verify, that all threads are waiting for data, either from the network or a queue within DataPrepper. This underlines the issue with the _ recyclerPool from Jackson.

KarstenSchnitter commented 1 month ago

This bug might be introduced by a transitive dependency from armeria v1.28.2, which has Jackson 2.17.0 as dependency. This would explain, why the issue with the LockJoinPool for the _recyclerPool arises, even when the explicit dependency for Jackson is specified to be Jackson 2.16.2. I am going to verify, what the actual version bundled into DataPrepper 2.8.0 is. The upgrade of armeria to v1.29.0 upgrade Jackson to the fixed version 2.17.1. Therefore, the problem should not be reproducible with the main branch.

KarstenSchnitter commented 1 month ago

I downloaded the Linux distribution of DataPrepper 2.8.0 and found the vulnerable Jackson version 2.17.0 in the libs folder:

dataprepper_2 8 0_jackson-dependencies

This indicates a conflict with the explicit jackson-bom 2.16.1 in https://github.com/opensearch-project/data-prepper/blob/2406edc768f369e344e572dea4e0f36aace843c9/build.gradle#L72

As a fix, the armeria version needs to be upgraded to at least 1.29.0. This has already been done by @dlvenable for the main branch. I suggest to backport https://github.com/opensearch-project/data-prepper/pull/4629 to the 2.8 release. Furthermore, the mismatch between the build.gradle and the actual Jackson version should be addressed.

dlvenable commented 1 month ago

@KarstenSchnitter , @JannikBrand , Thank you for reporting this issue and the fantastic analysis!

It does appear that Jackson 2.17.1 fixes this. I'm putting together some backport PRs to support a 2.8.1 release to fix this.

Would you be able to test this using a locally-built Data Prepper on the 2.8 branch to see if it resolves the issue?

KarstenSchnitter commented 1 month ago

@dlvenable: I talked to @JannikBrand about testing your change. In principle, we are able to verify, whether the upgrade is effective. But we both have a few days off, so that we can only look into that next week.

JannikBrand commented 1 month ago

I checked out the backport/backport-4744-to-2.8 branch from this PR in order to verify the change. I've built a docker image on that branch and let it run. Then I created a heap dump and could confirm, that there is no LockFreePool reference anymore within the openSearchClientRefresher > currentClient > transport > mapper > innerMapper > jsonProvider > jsonFactory:

image

So, I think the change took effect. I did not perform an actual performance test, since I first would have to reproduce this locally with 2.8.0 and afterwards again with the patched 2.8.1 version. I could still do it next week, or instead I could also just confirm that the memory issues do not reoccur in our environment after upgrading to the patched 2.8.1 version.

dlvenable commented 1 month ago

@JannikBrand , We just released Data Prepper 2.8.1 if you'd like to try to verify that the issue is resolved.

JannikBrand commented 1 month ago

@dlvenable I verified that the same aspect from my last comment is true for the released version:

could confirm, that there is no LockFreePool reference anymore

From our side the issue can be closed. Thanks for processing and fixing it so quickly!

dlvenable commented 1 month ago

You're welcome @JannikBrand. And thank you for the great analysis that helped us resolve this so quickly.