[BUG] High load when updating internal users using PATCH

Jakob3xD commented 9 months ago

What is the bug? We are currently experiencing following behavior, one of our clusters goes to 100% CPU usage on all nodes when we send a PATCH request changing internal_users. It does not matter if we add new users, delete them or delete non existing users. After the PATCH request is answered/returned the load goes up for a few minutes causing node failures from time to time or rejecting requests.

In other clusters we can also see an load increase after the PATCH requests returned but for a much shorter time frame. It seems like this is caused by BCrypt see jstack below.

How can one reproduce the bug? I could not reproduce it in with this the large time frame but the shorter load spike (~20s). Steps to reproduce the behavior:

Setup a 3 Node Opensearch Cluster - 3x 8 vCPU 30 GB Ram (Opensearch has 16GB Ram)
Create some Users that are allowed to index and request info (~300 Users)
Create load on the cluster with bulk and info request by doing a lot of requests - see golang file in below.
Send a PATCH requests that for example deletes 40 users that does not exists
See the Load going up on all nodes.

Golang create base load file:

```golang package main import ( "context" "fmt" "os" "strings" "math/rand" "time" "github.com/opensearch-project/opensearch-go/v3" "github.com/opensearch-project/opensearch-go/v3/opensearchapi" "github.com/opensearch-project/opensearch-go/v3/opensearchutil" ) func main() { if err := example(); err != nil { fmt.Println(fmt.Sprintf("Error: %s", err)) os.Exit(1) } } func example() error { index := "some-index" c := make(chan int) for i := 1; i <= 300; i++ { go indexer(initClient(i), index, c) go info(initClient(i), c) } <-c return nil } func initClient(loop int) (*opensearchapi.Client) { client, err := opensearchapi.NewClient( opensearchapi.Config{ Client: opensearch.Config{ Addresses: []string{ "https://loc1-test-opensearch1-n1.example.com", "https://loc1-test-opensearch1-n2.example.com", "https://loc1-test-opensearch1-n3.example.com", }, Username: fmt.Sprintf("node-index-%d", loop), Password: }, }, ) if err != nil { fmt.Printf("Unexpected error: %s\n", err) } return client } func info(client *opensearchapi.Client, c chan int) { ticker := time.NewTicker(time.Duration(500 + rand.Intn(500)) * time.Millisecond) defer ticker.Stop() for range ticker.C { client.Info(nil, nil) } c <- 0 } func indexer(client *opensearchapi.Client, index string, c chan int) { bi, err := opensearchutil.NewBulkIndexer(opensearchutil.BulkIndexerConfig{ Index: index, Client: client, FlushInterval: time.Duration(rand.Intn(500) + 500) * time.Millisecond, }) if err != nil { fmt.Printf("Unexpected error: %s\n", err) } ticker := time.NewTicker(time.Duration(rand.Intn(100) + 100) * time.Millisecond) defer ticker.Stop() for range ticker.C { err := bi.Add(context.Background(), opensearchutil.BulkIndexerItem{ Action: "index", Body: strings.NewReader(`{"title":"test", "doc": "90"}`), }) if err != nil { fmt.Printf("Unexpected error: %s\n", err) } } if err := bi.Close(context.Background()); err != nil { fmt.Printf("Unexpected error: %s\n", err) } c <- 0 } ```

What is the expected behavior? I would expect a lower load increase on all nodes when users change.

What is your host/environment?

OS: Ubuntu 22.04
Opensearch Docker - 2.11.1
Plugins - all default plugins and aiven/prometheus-exporter-plugin-for-opensearch except opensearch-performance-analyzer.

Do you have any screenshots?

Do you have any additional context? Jstack during the high load time frame: opensearch-high-load-jstack.txt

peternied commented 9 months ago

@Jakob3xD Thanks for filing this issue - really appreciate the test configuration and the graphs on the issue.

peternied commented 9 months ago

Maybe this is related https://stackoverflow.com/questions/36471723/bcrypt-performance-deterioration

Jakob3xD commented 9 months ago

Maybe this is related https://stackoverflow.com/questions/36471723/bcrypt-performance-deterioration

I have looked into it but the entropy_avail of /dev/random/ does not decrease. Also the short bash script for cat /dev/random does not have any affect on the systems load. Running pv /dev/random > /dev/null show a constant output of ~180MiB/s unaffected by the bash script from the answer. Therefore I assume this is not related to my issue?

stephen-crawford commented 9 months ago

[Triage] Hi @Jakob3xD thanks for filing this issue. Thank you for filing this issue. We can go ahead and mark this as triaged with a closure criteria of either correcting the CPU usage if it directly in our control or at least pointing to the correct dependency or issue that is causing this behavior.

Jakob3xD commented 9 months ago

We can go ahead and mark this as triaged with a closure criteria of either correcting the CPU usage if it directly in our control or at least pointing to the correct dependency or issue that is causing this behavior.

I am not sure I understand your sentence correctly. Can you please rephrase it for me.

I opened the issue as I am not familiar with java and have now further ideas on how to debug this or add more useful information.

stephen-crawford commented 9 months ago

Hi @Jakob3xD, the comment I left was part of the weekly triaging process the contributors go through. You can read more about it on the TRIAGING markdown file.

It is not clear whether OpenSearch or something we are dependent on is causing the CPU use. We may not be able to correct the issue if it is from a dependency, but we can at least point to the root issue if that is the case.

willyborankin commented 8 months ago

I reproduced it on 1 node cluster and the issue as @peternied mentioned related to BCrypt. Here is result without password validation:

Here is result with password validation:

opensearch-project / security

[BUG] High load when updating internal users using PATCH #4008