Open pmarjou opened 1 year ago
in addition, I've checked as mentioned in @systemosyn comment https://github.com/opensearch-project/performance-analyzer/issues/77#issuecomment-1310019450 I only have start events on /dev/shm/performanceanalyzer temp files :
Hey @pmarjou thank you for submitting your findings.
The erroneous response is definitely related to the missing finish events you pointed out, because without them, there's no trigger for the following line and nothing gets persisted inside the SQLite db for later querying.
I reproduced the error using the opensearchproject/opensearch:2.6.0 image and went back to my locally deployed cluster to debug. But there, finish events were present and the request was working as expected.
I built the PA and PA-RCA from the 2.6 branch and installed them with our standard build process, where we start from clean image and opensearch minimal build and install PA and PA-RCA and again, everything was working well. I first suspected that there was a build problem with the official image related to PA, PA-RCA but found nothing suspicious.
Finally the only obvious remaining difference between these two setups are the plugins installed by default on the standard opensearch image that we pull.
By uninstalling them in batches i found a sort-of convoluted root cause: presence of Security
plugin.
[opensearch@f440328382f9 ~]$ ls plugins/
opensearch-alerting opensearch-neural-search
opensearch-anomaly-detection opensearch-notifications
opensearch-asynchronous-search opensearch-notifications-core
opensearch-cross-cluster-replication opensearch-observability
opensearch-geospatial opensearch-performance-analyzer
opensearch-index-management opensearch-reports-scheduler
opensearch-job-scheduler opensearch-security
opensearch-knn opensearch-security-analytics
opensearch-ml opensearch-sql
[opensearch@f440328382f9 ~]$ cd /dev/shm/performanceanalyzer/
[opensearch@f440328382f9 performanceanalyzer]$ cat * | grep shardbulk
^threads/-1/shardbulk/0/start
^threads/-1/shardbulk/1/start
^threads/-1/shardbulk/2/start
^threads/-1/shardbulk/3/start
[opensearch@0f18089877e6 performanceanalyzer]$ ls /usr/share/opensearch/plugins/
opensearch-alerting opensearch-neural-search
opensearch-anomaly-detection opensearch-notifications
opensearch-asynchronous-search opensearch-notifications-core
opensearch-cross-cluster-replication opensearch-observability
opensearch-geospatial opensearch-performance-analyzer
opensearch-index-management opensearch-reports-scheduler
opensearch-job-scheduler opensearch-security-analytics
opensearch-knn opensearch-sql
opensearch-ml
[opensearch@0f18089877e6 performanceanalyzer]$ cat * | grep shardbulk
^threads/-1/shardbulk/1/start
^threads/-1/shardbulk/1/finish
^threads/-1/shardbulk/2/start
^threads/-1/shardbulk/2/finish
Above are the recorded shardbulk
events of the opensearch setups with and without Security
plugin installed, respectively. All expected events are present in the latter. This behavior was consistent across multiple tests.
And these are the logs with DEBUG
option enabled of the respective setups, with and without Security
plugin. Nothing obvious pops out at first glance so I'll have to go more into details. Comments and suggestions are welcome.
Context:
ShardBulk start and finish events are delivered to PA plugin through TransportChannel, though in a different way, thus the explanation why first works and the latter does not. As usually, channels are reachable through TransportRequestHandler which are supplied by TransportInterceptor's and they are registered inside OS core during initialization of NetworkModule.
Without Security
plugin installed, interceptor chain does not include interceptors from Security and channels from PerformanceAnalyzer are successfully reached and finish events are omitted. With Security interceptors registered, my assumption is that the chain somehow gets broken and handlers from PerformanceAnalyzer are never reached. These are my assumptions based on some findings and may not be true. Feedback appreciated.
Thank you @Tjofil for looking into the issue, please keep the thread updated
@peternied , @scrawfor99 Can you help here ? The Security Plugin is somehow interfering with the Performance Analyzer Listener for shardbulk close events.
When was the last version where this was working successfully? it seems like this has happened before and are now reopening it as still an issue. Did it get fixed back when it first got reported and now is failing again? or was it never fully fixed?
@davidlago It was actually never fixed fully, there were other bugs, like 283, from Reader side which caused the same effect as this one. We fixed them and tested it, unfortunately without Security plugin, and didn't catch this one.
@Tjofil I've added a diagram of how access is managed with the Security Plugin. We might need more context to know for sure. When PA is attempting to write metric information, the context of the request does not have permission to invoke the transport action to save the metric data. There are three ways this can be resolved
ThreadContext.stashContext()
to allow accesssequenceDiagram
participant Client
participant OpenSearch
participant SecurityPlugin
participant Cluster as Plugin
Client->>OpenSearch: Request
OpenSearch->>SecurityPlugin: Request with no Auth info
SecurityPlugin->>SecurityPlugin: Add Auth information to request context
OpenSearch->>Cluster: Client Request
Cluster->>SecurityPlugin: Execute transport layer action
SecurityPlugin->>SecurityPlugin: Check if action is allowed
alt Allowed
SecurityPlugin->>OpenSearch: Continue request
OpenSearch-->>Cluster: Transport layer action result
else Denied
SecurityPlugin-->>OpenSearch: Return 403 Forbidden
OpenSearch-->>Client: 403 Forbidden
end
alt Plugin run outside user context
Cluster->>Cluster: Stash context
Cluster->>SecurityPlugin: Execute transport layer action outside user context
Cluster-->>SecurityPlugin: Check if action is allowed
SecurityPlugin->>OpenSearch: Continue request
OpenSearch-->>Cluster: Transport layer action result
Cluster->>Cluster: Restore user context
end
Cluster-->>SecurityPlugin: Result
SecurityPlugin-->>OpenSearch: Result
OpenSearch-->>Client: Result
When was the last version where this was working successfully? it seems like this has happened before and are now reopening it as still an issue. Did it get fixed back when it first got reported and now is failing again? or was it never fully fixed?
Last time it was working successfully was on "Open Distro 1.13.1" (see #77 ) it stopped working when moving on Opensearch distribution
Hello all, any update about this issue ? Thanks
What is the bug? ShardBulkDocs Metrics is always empty when calling performance Analyzer API. Other metrics are working. I re-open the problem is still in 2.6.0 https://github.com/opensearch-project/performance-analyzer/issues/77
How can one reproduce the bug? Steps to reproduce the behavior:
docker compose enclosed docker-compose.txt
Follow step to activate performance analyzer : https://opensearch.org/docs/latest/opensearch/install/docker/#optional-set-up-performance-analyzer
Create test index and POST documents in this index PUT test (see enclosed) postdata.txt
while pushing documents, call from your browser : http://localhost:9600/_plugins/_performanceanalyzer/metrics?metrics=ShardBulkDocs,ShardEvents&agg=sum,sum you 'll get only ShardEvents numbers and no data on ShardBulkDocs
{"CkXsbxtMTDisNx2ahHqrIw": {"timestamp": 1635425175000, "data": {"fields":[{"name":"ShardBulkDocs","type":"DOUBLE"},{"name":"ShardEvents","type":"DOUBLE"}],"records":[[0.0,27.0]]}}}
What is the expected behavior? On previous Open Distro 1.13.1 this call was working
What is your host/environment?
Opensearch 2.6.0 Docker