splunk / splunk-operator

Splunk Operator for Kubernetes
Other
210 stars 115 forks source link

Splunk Operator: Cluster Manager dns resolution issue after restart #1168

Closed logsecvuln closed 1 year ago

logsecvuln commented 1 year ago

Please select the type of request

Bug

Tell us more

Describe the request

it is like that the indexer's are struggling to find manager node after each daemon restart. Therefore the expected behavior's is to finetune the configurations the way Kubernetes is dynamically resolve and redirect connections to the right node (in this case from peers to the manager node) that the fixup activities are not getting delayed.

Splunk setup on K8S

Reproduction/Testing steps

K8s environment

Proposed changes(optional)

K8s collector data(optional)

Additional context(optional) some of the relevant internal logs 05-25-2023 15:59:23.309 +0000 ERROR CMSearchHead [5476 GenerationGrabberThread] - 'failed method=POST path=/services/cluster/master/generation/16922772-EBA3-444C-BA7D-7FABEDE80417/?output_mode=json manager=splunk-prod-cluster-manager-service:8089 rv=0 gotConnectionError=1 gotUnexpectedStatusCode=0 actual_response_code=502 expected_response_code=2xx status_line="Error resolving: Name or service not known" socket_error="Cannot resolve hostname" remote_error=' for manager=https:XXXX:8089

05-26-2023 09:37:07.453 +0000 ERROR CMSlave [229233 CMNotifyThread] - sendQueuedRemoveSummaryS2ToMaster err=failed method=POST path=/services/cluster/master/control/control//remove_summary_s2/?output_mode=json manager=manager:8089 rv=0 gotConnectionError=1 gotUnexpectedStatusCode=0 actual_response_code=502 expected_response_code=2xx status_line="Error resolving: Name or service not known" socket_error="Cannot resolve hostname" remote_error=

05-26-2023 09:32:39.405 +0000 ERROR CMSearchHead [4631 GenerationGrabberThread] - 'failed method=POST path=/services/cluster/master/generation/112A4004-C61B-4EEC-94B1-E9F50454CD5D/?output_mode=json manager=manager:8089 rv=0 gotConnectionError=1 gotUnexpectedStatusCode=0 actual_response_code=502 expected_response_code=2xx status_line="Error resolving: Name or service not known" socket_error="Cannot resolve hostname" remote_error=' for manager=https://manager:8089

Failed to add peer 'guid=D0B608AA-AEB5-4DB6-A0C5-716F636E4ECE ip=ipaddress:8089 server name=indexer01' to the master. Error=non-zero transient-jobs=1, guid=D0B608AA-AEB5-4DB6-A0C5-716F636E4ECE, pending-jobs=1.

Failed to add peer 'guid=5C267A6C-41F2-416C-9DB7-63D8368597F4 ip=ip:8089 server name=ndexer-2' to the master. Error=non-zero transient-jobs=1, guid=5C267A6C-41F2-416C-9DB7-63D8368597F4, pending-jobs=1.

Search peer splunk-site4-prod-indexer-2 has the following message: Failed to register with cluster master reason: failed method=POST path=/services/cluster/master/peers/?output_mode=json manager=cluster-manager-service:8089 rv=0 gotConnectionError=0 gotUnexpectedStatusCode=1 actual_response_code=500 expected_response_code=2xx status_line="Internal Server Error" socket_error="No error" remote_error=Cannot add peer=ip mgmtport=8089 (reason: non-zero transient-jobs=1, guid=5C267A6C-41F2-416C-9DB7-63D8368597F4, pending-jobs=1). [ event=addPeer status=retrying AddPeerRequest: { active_bundle_id=70328DAA0F7A6437DC9D9A3488C87F5F add_type=ReAdd-As-Is base_generation_id=4780620 batch_serialno=1 batch_size=259 forwarderdata_rcv_port=9997 forwarderdata_use_ssl=0 guid=5C267A6C-41F2-416C-9DB7-63D8368597F4 last_complete_generation_id=5556767 latest_bundle_id=70328DAA0F7A6437DC9D9A3488C87F5F mgmt_port=8089 register_forwarder_address= register_replication_address= register_search_address= replication_port=9887 replication_use_ssl=0 replications= server_name=indexeexer-2 site=site4 splunk_version=9.0.4.1 splunkd_build_number=419ad9369127 status=Up } Batch 1/259 ].

marcispauls commented 1 year ago

Issue might be that you have not enough coredns pods if the cluster is pretty big or your instance type used to host coredns has too less pps on aws ec2 instance hits networking pps https://repost.aws/knowledge-center/ec2-instance-network-pps-limit check the limits and you might need to change instance types and increase coredns pod count.

logsecvuln commented 1 year ago

Thanks for the reply. that has resolved the issue.

logsecvuln commented 1 year ago

We still couldnt get the issue fixed as Splunk is still struggling with the same errors.