Splunk Operator: Cluster Manager dns resolution issue after restart

logsecvuln commented 1 year ago

Please select the type of request

Bug

Tell us more

Describe the request

A clear and concise description of the request.
cluster manager node not reachable to the indexer's/peers after recycles, rolling restart. there are few DNS resolution errors coming up for the cluster master and there are many failed API calls from peers to manager happens that wouldn't let cluster manager to finish all its fixup tasks. it has "some data not searchable", "SF not met" and "RF not met" for long period of time like sometimes more than 4 to 6 hours Expected behavior
A clear and concise description of what you expected to happen.

it is like that the indexer's are struggling to find manager node after each daemon restart. Therefore the expected behavior's is to finetune the configurations the way Kubernetes is dynamically resolve and redirect connections to the right node (in this case from peers to the manager node) that the fixup activities are not getting delayed.

Splunk setup on K8S

Details of the Splunk setup on the K8s cluster.

Reproduction/Testing steps

Steps to reproduce the bug. For an enhancement or feature request, please provide steps to test.

K8s environment

Useful information about the K8S environment being used. Eg. version of K8s, kind of K8s cluster etc..

Proposed changes(optional)

Proposed change, if any.

K8s collector data(optional)

Please provide data collected from the K8s collectors, if any.

Additional context(optional) some of the relevant internal logs 05-25-2023 15:59:23.309 +0000 ERROR CMSearchHead [5476 GenerationGrabberThread] - 'failed method=POST path=/services/cluster/master/generation/16922772-EBA3-444C-BA7D-7FABEDE80417/?output_mode=json manager=splunk-prod-cluster-manager-service:8089 rv=0 gotConnectionError=1 gotUnexpectedStatusCode=0 actual_response_code=502 expected_response_code=2xx status_line="Error resolving: Name or service not known" socket_error="Cannot resolve hostname" remote_error=' for manager=https:XXXX:8089

05-26-2023 09:37:07.453 +0000 ERROR CMSlave [229233 CMNotifyThread] - sendQueuedRemoveSummaryS2ToMaster err=failed method=POST path=/services/cluster/master/control/control//remove_summary_s2/?output_mode=json manager=manager:8089 rv=0 gotConnectionError=1 gotUnexpectedStatusCode=0 actual_response_code=502 expected_response_code=2xx status_line="Error resolving: Name or service not known" socket_error="Cannot resolve hostname" remote_error=

05-26-2023 09:32:39.405 +0000 ERROR CMSearchHead [4631 GenerationGrabberThread] - 'failed method=POST path=/services/cluster/master/generation/112A4004-C61B-4EEC-94B1-E9F50454CD5D/?output_mode=json manager=manager:8089 rv=0 gotConnectionError=1 gotUnexpectedStatusCode=0 actual_response_code=502 expected_response_code=2xx status_line="Error resolving: Name or service not known" socket_error="Cannot resolve hostname" remote_error=' for manager=https://manager:8089

Failed to add peer 'guid=D0B608AA-AEB5-4DB6-A0C5-716F636E4ECE ip=ipaddress:8089 server name=indexer01' to the master. Error=non-zero transient-jobs=1, guid=D0B608AA-AEB5-4DB6-A0C5-716F636E4ECE, pending-jobs=1.

Failed to add peer 'guid=5C267A6C-41F2-416C-9DB7-63D8368597F4 ip=ip:8089 server name=ndexer-2' to the master. Error=non-zero transient-jobs=1, guid=5C267A6C-41F2-416C-9DB7-63D8368597F4, pending-jobs=1.

Search peer splunk-site4-prod-indexer-2 has the following message: Failed to register with cluster master reason: failed method=POST path=/services/cluster/master/peers/?output_mode=json manager=cluster-manager-service:8089 rv=0 gotConnectionError=0 gotUnexpectedStatusCode=1 actual_response_code=500 expected_response_code=2xx status_line="Internal Server Error" socket_error="No error" remote_error=Cannot add peer=ip mgmtport=8089 (reason: non-zero transient-jobs=1, guid=5C267A6C-41F2-416C-9DB7-63D8368597F4, pending-jobs=1). [ event=addPeer status=retrying AddPeerRequest: { active_bundle_id=70328DAA0F7A6437DC9D9A3488C87F5F add_type=ReAdd-As-Is base_generation_id=4780620 batch_serialno=1 batch_size=259 forwarderdata_rcv_port=9997 forwarderdata_use_ssl=0 guid=5C267A6C-41F2-416C-9DB7-63D8368597F4 last_complete_generation_id=5556767 latest_bundle_id=70328DAA0F7A6437DC9D9A3488C87F5F mgmt_port=8089 register_forwarder_address= register_replication_address= register_search_address= replication_port=9887 replication_use_ssl=0 replications= server_name=indexeexer-2 site=site4 splunk_version=9.0.4.1 splunkd_build_number=419ad9369127 status=Up } Batch 1/259 ].

marcispauls commented 1 year ago

Issue might be that you have not enough coredns pods if the cluster is pretty big or your instance type used to host coredns has too less pps on aws ec2 instance hits networking pps https://repost.aws/knowledge-center/ec2-instance-network-pps-limit check the limits and you might need to change instance types and increase coredns pod count.

logsecvuln commented 1 year ago

Thanks for the reply. that has resolved the issue.

logsecvuln commented 1 year ago

We still couldnt get the issue fixed as Splunk is still struggling with the same errors.

splunk / splunk-operator

Splunk Operator: Cluster Manager dns resolution issue after restart #1168

Please select the type of request

Tell us more