Event Queue and Memcached unhealthy

greycel commented 7 months ago

Hi Team,

1) Event Queue Issue retrieving a large dataset We've been receiving about 2,25,000 alerts per day roughly from different log sources and last checked there were 7,45,000 alerts for the last 7 days. For the last 10-15day the event queue page has been taking time to load and sometimes not loading eventually throwing opensearchpy.exceptions.TransportError: TransportError(503, 'search_phase_execution_exception') below is the error log from the API service.

Current setup - AWS Managed Open-Search, Running all reflex services (API, UI, Memcached, Agents) on a single host, would you suggest running agents on a different host..?
Any guidance on optimizing the performance of reflex-soar and fine-tuning the handling of such event flow and load time would be very appreciated.

reflex-api  | 2024-03-19 15:10:46,704 - apscheduler.scheduler - WARNING - Execution of job "DetectionState.check_state (trigger: interval[0:00:10], next run at: 2024-03-19 15:10:46 UTC)" skipped: maximum number of running instances reached (1)
reflex-api  | 2024-03-19 15:11:06,703 - apscheduler.scheduler - WARNING - Execution of job "DetectionState.check_state (trigger: interval[0:00:10], next run at: 2024-03-19 15:11:06 UTC)" skipped: maximum number of running instances reached (1)
reflex-api  | 2024-03-19 15:11:09,692 - opensearch - WARNING - POST https://vpc-reflexsoar-2ilwlvjslfbp4il4dnta5b7twe.us-east-1.es.amazonaws.com:443/reflex-events/_search [status:503 request:13.398s]
reflex-api  | 2024-03-19 15:11:21,807 - opensearch - WARNING - POST https://vpc-reflexsoar-2ilwlvjslfbp4il4dnta5b7twe.us-east-1.es.amazonaws.com:443/reflex-events/_search [status:503 request:12.115s]
reflex-api  | 2024-03-19 15:11:26,703 - apscheduler.scheduler - WARNING - Execution of job "DetectionState.check_state (trigger: interval[0:00:10], next run at: 2024-03-19 15:11:26 UTC)" skipped: maximum number of running instances reached (1)
reflex-api  | 2024-03-19 15:11:35,095 - opensearch - WARNING - POST https://vpc-reflexsoar-2ilwlvjslfbp4il4dnta5b7twe.us-east-1.es.amazonaws.com:443/reflex-events/_search [status:503 request:13.287s]
reflex-api  | 2024-03-19 15:11:46,574 - opensearch - WARNING - POST https://vpc-reflexsoar-2ilwlvjslfbp4il4dnta5b7twe.us-east-1.es.amazonaws.com:443/reflex-events/_search [status:503 request:11.479s]
reflex-api  | [2024-03-19 15:11:46 +0000] [142] [ERROR] Exception on /api/v2.0/event [GET]
reflex-api  | Traceback (most recent call last):
reflex-api  |   File "/root/.local/share/virtualenvs/-x-v5uFv0/lib/python3.8/site-packages/flask/app.py", line 1523, in full_dispatch_request
reflex-api  |     rv = self.dispatch_request()
reflex-api  |   File "/root/.local/share/virtualenvs/-x-v5uFv0/lib/python3.8/site-packages/flask/app.py", line 1509, in dispatch_request
reflex-api  |     return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
reflex-api  |   File "/root/.local/share/virtualenvs/-x-v5uFv0/lib/python3.8/site-packages/flask_restx/api.py", line 404, in wrapper
reflex-api  |     resp = resource(*args, **kwargs)
reflex-api  |   File "/root/.local/share/virtualenvs/-x-v5uFv0/lib/python3.8/site-packages/flask/views.py", line 84, in view
reflex-api  |     return current_app.ensure_sync(self.dispatch_request)(*args, **kwargs)
reflex-api  |   File "/root/.local/share/virtualenvs/-x-v5uFv0/lib/python3.8/site-packages/flask_restx/resource.py", line 46, in dispatch_request
reflex-api  |     resp = meth(*args, **kwargs)
reflex-api  |   File "/root/.local/share/virtualenvs/-x-v5uFv0/lib/python3.8/site-packages/flask_restx/marshalling.py", line 244, in wrapper
reflex-api  |     resp = f(*args, **kwargs)
reflex-api  |   File "/app/api_v2/utils.py", line 223, in wrapper
reflex-api  |     return f(*args, **kwargs, current_user=current_user)
reflex-api  |   File "/app/api_v2/utils.py", line 354, in wrapper
reflex-api  |     return f(*args, **kwargs)
reflex-api  |   File "/app/api_v2/resource/event.py", line 366, in get
reflex-api  |     events = search.execute()
reflex-api  |   File "/root/.local/share/virtualenvs/-x-v5uFv0/lib/python3.8/site-packages/opensearch_dsl/search.py", line 721, in execute
reflex-api  |     opensearch.search(
reflex-api  |   File "/root/.local/share/virtualenvs/-x-v5uFv0/lib/python3.8/site-packages/opensearchpy/client/utils.py", line 177, in _wrapped
reflex-api  |     return func(*args, params=params, headers=headers, **kwargs)
reflex-api  |   File "/root/.local/share/virtualenvs/-x-v5uFv0/lib/python3.8/site-packages/opensearchpy/client/__init__.py", line 1593, in search
reflex-api  |     return self.transport.perform_request(
reflex-api  |   File "/root/.local/share/virtualenvs/-x-v5uFv0/lib/python3.8/site-packages/opensearchpy/transport.py", line 405, in perform_request
reflex-api  |     raise e
reflex-api  |   File "/root/.local/share/virtualenvs/-x-v5uFv0/lib/python3.8/site-packages/opensearchpy/transport.py", line 368, in perform_request
reflex-api  |     status, headers_response, data = connection.perform_request(
reflex-api  |   File "/root/.local/share/virtualenvs/-x-v5uFv0/lib/python3.8/site-packages/opensearchpy/connection/http_urllib3.py", line 275, in perform_request
reflex-api  |     self._raise_error(
reflex-api  |   File "/root/.local/share/virtualenvs/-x-v5uFv0/lib/python3.8/site-packages/opensearchpy/connection/base.py", line 300, in _raise_error
reflex-api  |     raise HTTP_EXCEPTIONS.get(status_code, TransportError)(
reflex-api  | opensearchpy.exceptions.TransportError: TransportError(503, 'search_phase_execution_exception')
reflex-api  | 2024-03-19 15:20:16,703 - apscheduler.scheduler - WARNING - Execution of job "DetectionState.check_state (trigger: interval[0:00:10], next run at: 2024-03-19 15:20:16 UTC)" skipped: maximum number of running instances reached (1)
reflex-api  | 2024-03-19 15:27:07,039 - opensearch - WARNING - DELETE https://vpc-reflexsoar-2ilwlvjslfbp4il4dnta5b7twe.us-east-1.es.amazonaws.com:443/reflex-threat-values-0.1.4/_doc/nvv7Vo4BUN9Zj4dfz3MO [status:404 request:0.044s]
reflex-api  | 2024-03-19 15:27:07,040 - ThreatPoller - ERROR - An error occurred while trying to purge expired values. NotFoundError(404, '{"_index":"reflex-threat-values-0.1.4","_id":"nvv7Vo4BUN9Zj4dfz3MO","_version":3,"result":"not_found","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4885,"_primary_term":1}')
reflex-api  | 2024-03-19 15:27:07,040 - ThreatPoller - ERROR - An error occurred while trying to purge expired values. NotFoundError(404, '{"_index":"reflex-threat-values-0.1.4","_id":"nvv7Vo4BUN9Zj4dfz3MO","_version":3,"result":"not_found","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":4885,"_primary_term":1}')

2) After restarting all the reflex docker services, the Memcached docker container state is displaying as unhealthy and I've been receiving Memcached problems in the "reflex-api" service logs. service is reachable and connected when checked with "nc localhost 11211". I'm not sure which reflex components will be affected by this, need help.

reflex-api  | 2024-03-19 12:22:11,237 - opensearch - WARNING - POST https://vpc-reflexsoar-2ilwlvjslfbp4il4dnta5b7twe.us-east-1.es.amazonaws.com:443/reflex-expired-tokens/_search [status:N/A request:0.003s]
reflex-api  | Traceback (most recent call last):
reflex-api  |   File "/root/.local/share/virtualenvs/-x-v5uFv0/lib/python3.8/site-packages/urllib3/connectionpool.py", line 714, in urlopen
reflex-api  |     httplib_response = self._make_request(
reflex-api  |   File "/root/.local/share/virtualenvs/-x-v5uFv0/lib/python3.8/site-packages/urllib3/connectionpool.py", line 466, in _make_request
reflex-api  |     six.raise_from(e, None)
reflex-api  |   File "<string>", line 3, in raise_from
reflex-api  |   File "/root/.local/share/virtualenvs/-x-v5uFv0/lib/python3.8/site-packages/urllib3/connectionpool.py", line 461, in _make_request
reflex-api  |     httplib_response = conn.getresponse()
reflex-api  |   File "/usr/local/lib/python3.8/http/client.py", line 1322, in getresponse
reflex-api  |     response.begin()
reflex-api  |   File "/usr/local/lib/python3.8/http/client.py", line 303, in begin
reflex-api  |     version, status, reason = self._read_status()
reflex-api  |   File "/usr/local/lib/python3.8/http/client.py", line 272, in _read_status
reflex-api  |     raise RemoteDisconnected("Remote end closed connection without"
reflex-api  | http.client.RemoteDisconnected: Remote end closed connection without response
reflex-api  |
reflex-api  | During handling of the above exception, another exception occurred:
reflex-api  |
reflex-api  | Traceback (most recent call last):
reflex-api  |   File "/root/.local/share/virtualenvs/-x-v5uFv0/lib/python3.8/site-packages/opensearchpy/connection/http_urllib3.py", line 249, in perform_request
reflex-api  |     response = self.pool.urlopen(
reflex-api  |   File "/root/.local/share/virtualenvs/-x-v5uFv0/lib/python3.8/site-packages/urllib3/connectionpool.py", line 798, in urlopen
reflex-api  |     retries = retries.increment(
reflex-api  |   File "/root/.local/share/virtualenvs/-x-v5uFv0/lib/python3.8/site-packages/urllib3/util/retry.py", line 525, in increment
reflex-api  |     raise six.reraise(type(error), error, _stacktrace)
reflex-api  |   File "/root/.local/share/virtualenvs/-x-v5uFv0/lib/python3.8/site-packages/urllib3/packages/six.py", line 769, in reraise
reflex-api  |     raise value.with_traceback(tb)
reflex-api  |   File "/root/.local/share/virtualenvs/-x-v5uFv0/lib/python3.8/site-packages/urllib3/connectionpool.py", line 714, in urlopen
reflex-api  |     httplib_response = self._make_request(
reflex-api  |   File "/root/.local/share/virtualenvs/-x-v5uFv0/lib/python3.8/site-packages/urllib3/connectionpool.py", line 466, in _make_request
reflex-api  |     six.raise_from(e, None)
reflex-api  |   File "<string>", line 3, in raise_from
reflex-api  |   File "/root/.local/share/virtualenvs/-x-v5uFv0/lib/python3.8/site-packages/urllib3/connectionpool.py", line 461, in _make_request
reflex-api  |     httplib_response = conn.getresponse()
reflex-api  |   File "/usr/local/lib/python3.8/http/client.py", line 1322, in getresponse
reflex-api  |     response.begin()
reflex-api  |   File "/usr/local/lib/python3.8/http/client.py", line 303, in begin
reflex-api  |     version, status, reason = self._read_status()
reflex-api  |   File "/usr/local/lib/python3.8/http/client.py", line 272, in _read_status
reflex-api  |     raise RemoteDisconnected("Remote end closed connection without"
reflex-api  | urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
reflex-api  | 2024-03-19 12:22:11,304 - opensearch - WARNING - POST https://vpc-reflexsoar-2ilwlvjslfbp4il4dnta5b7twe.us-east-1.es.amazonaws.com:443/reflex-expired-tokens/_search [status:N/A request:0.003s]
reflex-api  | Traceback (most recent call last):
reflex-api  |   File "/root/.local/share/virtualenvs/-x-v5uFv0/lib/python3.8/site-packages/urllib3/connectionpool.py", line 714, in urlopen
reflex-api  |     httplib_response = self._make_request(
reflex-api  |   File "/root/.local/share/virtualenvs/-x-v5uFv0/lib/python3.8/site-packages/urllib3/connectionpool.py", line 466, in _make_request
reflex-api  |     six.raise_from(e, None)
reflex-api  |   File "<string>", line 3, in raise_from
reflex-api  |   File "/root/.local/share/virtualenvs/-x-v5uFv0/lib/python3.8/site-packages/urllib3/connectionpool.py", line 461, in _make_request
reflex-api  |     httplib_response = conn.getresponse()
reflex-api  |   File "/usr/local/lib/python3.8/http/client.py", line 1322, in getresponse
reflex-api  |     response.begin()
reflex-api  |   File "/usr/local/lib/python3.8/http/client.py", line 303, in begin
reflex-api  |     version, status, reason = self._read_status()
reflex-api  |   File "/usr/local/lib/python3.8/http/client.py", line 272, in _read_status
reflex-api  |     raise RemoteDisconnected("Remote end closed connection without"
reflex-api  | http.client.RemoteDisconnected: Remote end closed connection without response
reflex-api  |
reflex-api  | During handling of the above exception, another exception occurred:
reflex-api  |
reflex-api  | Traceback (most recent call last):
reflex-api  |   File "/root/.local/share/virtualenvs/-x-v5uFv0/lib/python3.8/site-packages/opensearchpy/connection/http_urllib3.py", line 249, in perform_request
reflex-api  |     response = self.pool.urlopen(
reflex-api  |   File "/root/.local/share/virtualenvs/-x-v5uFv0/lib/python3.8/site-packages/urllib3/connectionpool.py", line 798, in urlopen
reflex-api  |     retries = retries.increment(
reflex-api  |   File "/root/.local/share/virtualenvs/-x-v5uFv0/lib/python3.8/site-packages/urllib3/util/retry.py", line 525, in increment
reflex-api  |     raise six.reraise(type(error), error, _stacktrace)
reflex-api  |   File "/root/.local/share/virtualenvs/-x-v5uFv0/lib/python3.8/site-packages/urllib3/packages/six.py", line 769, in reraise
reflex-api  |     raise value.with_traceback(tb)
reflex-api  |   File "/root/.local/share/virtualenvs/-x-v5uFv0/lib/python3.8/site-packages/urllib3/connectionpool.py", line 714, in urlopen
reflex-api  |     httplib_response = self._make_request(
reflex-api  |   File "/root/.local/share/virtualenvs/-x-v5uFv0/lib/python3.8/site-packages/urllib3/connectionpool.py", line 466, in _make_request
reflex-api  |     six.raise_from(e, None)
reflex-api  |   File "<string>", line 3, in raise_from
reflex-api  |   File "/root/.local/share/virtualenvs/-x-v5uFv0/lib/python3.8/site-packages/urllib3/connectionpool.py", line 461, in _make_request
reflex-api  |     httplib_response = conn.getresponse()
reflex-api  |   File "/usr/local/lib/python3.8/http/client.py", line 1322, in getresponse
reflex-api  |     response.begin()
reflex-api  |   File "/usr/local/lib/python3.8/http/client.py", line 303, in begin
reflex-api  |     version, status, reason = self._read_status()
reflex-api  |   File "/usr/local/lib/python3.8/http/client.py", line 272, in _read_status
reflex-api  |     raise RemoteDisconnected("Remote end closed connection without"
reflex-api  | urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
reflex-api  | [2024-03-19 12:22:24 +0000] [138] [ERROR] Error checking memcached for event-processing-9d5d9c70-0261-4453-bc87-b078baf39a5c: timed out
reflex-api  | 2024-03-19 12:22:24,710 - app - ERROR - Error checking memcached for event-processing-9d5d9c70-0261-4453-bc87-b078baf39a5c: timed out
reflex-api  | [2024-03-19 12:22:24 +0000] [142] [ERROR] Error checking memcached for event-processing-e3840800-73b4-49c2-94ac-2e4d590c8f3d: timed out
reflex-api  | 2024-03-19 12:22:24,799 - app - ERROR - Error checking memcached for event-processing-e3840800-73b4-49c2-94ac-2e4d590c8f3d: timed out
reflex-api  | [2024-03-19 12:22:44 +0000] [139] [ERROR] Error checking memcached for event-processing-76099e0b-446b-451f-b614-36e0280c03c7: timed out
reflex-api  | 2024-03-19 12:22:44,582 - app - ERROR - Error checking memcached for event-processing-76099e0b-446b-451f-b614-36e0280c03c7: timed out
reflex-api  | [2024-03-19 12:22:44 +0000] [140] [ERROR] Error checking memcached for event-processing-dd4c4f65-3ed8-4342-9677-008cf156d215: timed out
reflex-api  | 2024-03-19 12:22:44,647 - app - ERROR - Error checking memcached for event-processing-dd4c4f65-3ed8-4342-9677-008cf156d215: timed out
reflex-api  | [2024-03-19 12:22:45 +0000] [143] [ERROR] Error checking memcached for event-processing-64dfb364-edb8-4aef-9fde-89395b06a536: timed out
reflex-api  | 2024-03-19 12:22:45,108 - app - ERROR - Error checking memcached for event-processing-64dfb364-edb8-4aef-9fde-89395b06a536: timed out
reflex-api  | [2024-03-19 12:22:45 +0000] [144] [ERROR] Error checking memcached for event-processing-fd9bc87f-97a6-4cb1-b03a-98197b7c921b: timed out
reflex-api  | 2024-03-19 12:22:45,243 - app - ERROR - Error checking memcached for event-processing-fd9bc87f-97a6-4cb1-b03a-98197b7c921b: timed out
reflex-api  | [2024-03-19 12:23:05 +0000] [137] [ERROR] Error checking memcached for event-processing-afca4f5b-c429-4172-bb03-405dbd7feb6a: timed out
reflex-api  | 2024-03-19 12:23:05,799 - app - ERROR - Error checking memcached for event-processing-afca4f5b-c429-4172-bb03-405dbd7feb6a: timed out
reflex-api  | 2024-03-19 12:23:06,704 - apscheduler.scheduler - WARNING - Execution of job "DetectionState.check_state (trigger: interval[0:00:10], next run at: 2024-03-19 12:23:06 UTC)" skipped: maximum number of running instances reached (1)
reflex-api  | 2024-03-19 12:36:46,703 - apscheduler.scheduler - WARNING - Execution of job "DetectionState.check_state (trigger: interval[0:00:10], next run at: 2024-03-19 12:36:46 UTC)" skipped: maximum number of running instances reached (1)
reflex-api  | 2024-03-19 12:54:16,703 - apscheduler.scheduler - WARNING - Execution of job "DetectionState.check_state (trigger: interval[0:00:10], next run at: 2024-03-19 12:54:16 UTC)" skipped: maximum number of running instances reached (1)
reflex-api  | 2024-03-19 12:57:06,705 - apscheduler.scheduler - WARNING - Execution of job "DetectionState.check_state (trigger: interval[0:00:10], next run at: 2024-03-19 12:57:06 UTC)" skipped: maximum number of running instances reached (1)
reflex-api  | 2024-03-19 13:15:16,704 - apscheduler.scheduler - WARNING - Execution of job "DetectionState.check_state (trigger: interval[0:00:10], next run at: 2024-03-19 13:15:16 UTC)" skipped: maximum number of running instances reached (1)
reflex-api  | 2024-03-19 13:17:06,953 - opensearch - WARNING - DELETE https://vpc-reflexsoar-2ilwlvjslfbp4il4dnta5b7twe.us-east-1.es.amazonaws.com:443/reflex-threat-values-0.1.4/_doc/V_CHVo4BUN9Zj4df375B [status:404 request:0.030s]
reflex-api  | 2024-03-19 13:17:06,953 - ThreatPoller - ERROR - An error occurred while trying to purge expired values. NotFoundError(404, '{"_index":"reflex-threat-values-0.1.4","_id":"V_CHVo4BUN9Zj4df375B","_version":3,"result":"not_found","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":5144,"_primary_term":1}')
reflex-api  | 2024-03-19 13:17:06,953 - ThreatPoller - ERROR - An error occurred while trying to purge expired values. NotFoundError(404, '{"_index":"reflex-threat-values-0.1.4","_id":"V_CHVo4BUN9Zj4df375B","_version":3,"result":"not_found","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":5144,"_primary_term":1}')
reflex-api  | 2024-03-19 13:19:06,703 - apscheduler.scheduler - WARNING - Execution of job "DetectionState.check_state (trigger: interval[0:00:10], next run at: 2024-03-19 13:19:06 UTC)" skipped: maximum number of running instances reached (1)
reflex-api  | 2024-03-19 13:25:16,703 - apscheduler.scheduler - WARNING - Execution of job "DetectionState.check_state (trigger: interval[0:00:10], next run at: 2024-03-19 13:25:16 UTC)" skipped: maximum number of running instances reached (1)
reflex-api  | 2024-03-19 13:25:36,703 - apscheduler.scheduler - WARNING - Execution of job "DetectionState.check_state (trigger: interval[0:00:10], next run at: 2024-03-19 13:25:36 UTC)" skipped: maximum number of running instances reached (1)
reflex-api  | 2024-03-19 13:35:36,703 - apscheduler.scheduler - WARNING - Execution of job "DetectionState.check_state (trigger: interval[0:00:10], next run at: 2024-03-19 13:35:36 UTC)" skipped: maximum number of running instances reached (1)
reflex-api  | 2024-03-19 13:51:56,704 - apscheduler.scheduler - WARNING - Execution of job "DetectionState.check_state (trigger: interval[0:00:10], next run at: 2024-03-19 13:51:56 UTC)" skipped: maximum number of running instances reached (1)
reflex-api  | 2024-03-19 13:57:06,703 - apscheduler.scheduler - WARNING - Execution of job "DetectionState.check_state (trigger: interval[0:00:10], next run at: 2024-03-19 13:57:06 UTC)" skipped: maximum number of running instances reached (1)
reflex-api  | 2024-03-19 14:10:36,704 - apscheduler.scheduler - WARNING - Execution of job "DetectionState.check_state (trigger: interval[0:00:10], next run at: 2024-03-19 14:10:36 UTC)" skipped: maximum number of running instances reached (1)

n3tsurge commented 7 months ago

Is that 225,000 events per day? Thats not overly large but what I think is happening is you aren't signaturing your events correctly resulting in a large number of signatures over your search window. Can you run the following via dev tools? If the total number of unique signatures is above 10000 you're not signaturing correctly and deduplicating your events. Non signaturing correctly will also allow duplicates of the same event to come in which can result in Event explosion. I would also recommend getting to the latest release.

GET reflex-events/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "query_string": {
            "query": "status.name: New"
          }
        },
        {
          "range": {
            "created_at": {
              "gte": "now-7d"
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "unique_sigs": {
      "cardinality": {
        "field": "signature"
      }
    }
  },
  "size": 0
}

greycel commented 7 months ago

Output for given query - unique signatures count "144521",

Signatures are used for deduplicating events and events get loaded properly if the number of signatures in a search window is lower than 10000. Is my understanding correct..

Signature fields configured for one of the organization's input (Windows) "host_name, user_domain, user_name, process_name, process_executable, event_id, event_channel, event_category"

Since Default Org's admin account can view and load events from multiple organizations, is there any change where the number of signatures can go above 10000 and fail to load when quering events from default Org's admin login (even after its correctly singnatured)..?

n3tsurge commented 7 months ago

There is no way to change this in Reflex, it is an opensearch/elasticsearch default max bucket count setting but you could change it in the index settings manually. Depending on the version you are on, we did rework some of the code related to this that might solve the issue, I recommend upgrading and seeing if the issue persists. As a note, we have yet to experience this issue internally and we have 40+ sub-tenants and default org can see them all just fine.

JustinHendersonSMAPPER commented 7 months ago

I'm going to second @n3tsurge comment. You are likely encountering a bug that is fixed in a newer version. For example, we have 301,328 unique signatures in the last 7 days and we can view events across all clients from the Default Organization page.

reflexsoar / reflex-api

Event Queue and Memcached unhealthy #480