Open Jakob3xD opened 9 months ago
@Jakob3xD Thanks for filing this issue - really appreciate the test configuration and the graphs on the issue.
Maybe this is related https://stackoverflow.com/questions/36471723/bcrypt-performance-deterioration
Maybe this is related https://stackoverflow.com/questions/36471723/bcrypt-performance-deterioration
I have looked into it but the entropy_avail
of /dev/random/
does not decrease. Also the short bash script for cat /dev/random
does not have any affect on the systems load. Running pv /dev/random > /dev/null
show a constant output of ~180MiB/s unaffected by the bash script from the answer. Therefore I assume this is not related to my issue?
[Triage] Hi @Jakob3xD thanks for filing this issue. Thank you for filing this issue. We can go ahead and mark this as triaged with a closure criteria of either correcting the CPU usage if it directly in our control or at least pointing to the correct dependency or issue that is causing this behavior.
We can go ahead and mark this as triaged with a closure criteria of either correcting the CPU usage if it directly in our control or at least pointing to the correct dependency or issue that is causing this behavior.
I am not sure I understand your sentence correctly. Can you please rephrase it for me.
I opened the issue as I am not familiar with java and have now further ideas on how to debug this or add more useful information.
Hi @Jakob3xD, the comment I left was part of the weekly triaging process the contributors go through. You can read more about it on the TRIAGING markdown file.
It is not clear whether OpenSearch or something we are dependent on is causing the CPU use. We may not be able to correct the issue if it is from a dependency, but we can at least point to the root issue if that is the case.
I reproduced it on 1 node cluster and the issue as @peternied mentioned related to BCrypt
.
Here is result without password validation:
Here is result with password validation:
What is the bug? We are currently experiencing following behavior, one of our clusters goes to 100% CPU usage on all nodes when we send a PATCH request changing internal_users. It does not matter if we add new users, delete them or delete non existing users. After the PATCH request is answered/returned the load goes up for a few minutes causing node failures from time to time or rejecting requests.
In other clusters we can also see an load increase after the PATCH requests returned but for a much shorter time frame. It seems like this is caused by BCrypt see jstack below.
How can one reproduce the bug? I could not reproduce it in with this the large time frame but the shorter load spike (~20s). Steps to reproduce the behavior:
Golang create base load file:
```golang package main import ( "context" "fmt" "os" "strings" "math/rand" "time" "github.com/opensearch-project/opensearch-go/v3" "github.com/opensearch-project/opensearch-go/v3/opensearchapi" "github.com/opensearch-project/opensearch-go/v3/opensearchutil" ) func main() { if err := example(); err != nil { fmt.Println(fmt.Sprintf("Error: %s", err)) os.Exit(1) } } func example() error { index := "some-index" c := make(chan int) for i := 1; i <= 300; i++ { go indexer(initClient(i), index, c) go info(initClient(i), c) } <-c return nil } func initClient(loop int) (*opensearchapi.Client) { client, err := opensearchapi.NewClient( opensearchapi.Config{ Client: opensearch.Config{ Addresses: []string{ "https://loc1-test-opensearch1-n1.example.com", "https://loc1-test-opensearch1-n2.example.com", "https://loc1-test-opensearch1-n3.example.com", }, Username: fmt.Sprintf("node-index-%d", loop), Password:
},
},
)
if err != nil {
fmt.Printf("Unexpected error: %s\n", err)
}
return client
}
func info(client *opensearchapi.Client, c chan int) {
ticker := time.NewTicker(time.Duration(500 + rand.Intn(500)) * time.Millisecond)
defer ticker.Stop()
for range ticker.C {
client.Info(nil, nil)
}
c <- 0
}
func indexer(client *opensearchapi.Client, index string, c chan int) {
bi, err := opensearchutil.NewBulkIndexer(opensearchutil.BulkIndexerConfig{
Index: index,
Client: client,
FlushInterval: time.Duration(rand.Intn(500) + 500) * time.Millisecond,
})
if err != nil {
fmt.Printf("Unexpected error: %s\n", err)
}
ticker := time.NewTicker(time.Duration(rand.Intn(100) + 100) * time.Millisecond)
defer ticker.Stop()
for range ticker.C {
err := bi.Add(context.Background(), opensearchutil.BulkIndexerItem{
Action: "index",
Body: strings.NewReader(`{"title":"test", "doc": "90"}`),
})
if err != nil {
fmt.Printf("Unexpected error: %s\n", err)
}
}
if err := bi.Close(context.Background()); err != nil {
fmt.Printf("Unexpected error: %s\n", err)
}
c <- 0
}
```
What is the expected behavior? I would expect a lower load increase on all nodes when users change.
What is your host/environment?
Do you have any screenshots?
Do you have any additional context? Jstack during the high load time frame: opensearch-high-load-jstack.txt