splunk / splunk-operator

Splunk Operator for Kubernetes
Other
202 stars 112 forks source link

Splunk Operator: Change the readiness probe for search head clusters to not show instances that are in manual detention as ready #1322

Open gjanders opened 2 months ago

gjanders commented 2 months ago

Please select the type of request

Enhancement

Tell us more

Describe the request Currently the readiness probe used in a Splunk search head cluster tests if port 8089 is running, if it is running the instance is "ready", if not it is not ready. However I'd like to have this further customized to ignore nodes that are in manual (or automatic detention).

Expected behavior The probe should check the status of the member, for example it could hit the endpoint https://localhost:8089/services/shcluster/member/ready and a response without errors would be considered successful.

A response such as:

<?xml version="1.0" encoding="UTF-8"?>
<!--This is to override browser formatting; see server.conf[httpServer] to disable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .-->
<?xml-stylesheet type="text/xml" href="/static/atom.xsl"?>
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:s="http://dev.splunk.com/ns/rest" xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">
  <title>shclusterready</title>
  <id>https://localhost:8089/services/shcluster/member/ready</id>
  <updated>2024-04-13T15:50:05+10:00</updated>
  <generator build="d95b3299fa65" version="9.1.3"/>
  <author>
    <name>Splunk</name>
  </author>
  <opensearch:totalResults>0</opensearch:totalResults>
  <opensearch:itemsPerPage>30</opensearch:itemsPerPage>
  <opensearch:startIndex>0</opensearch:startIndex>
  <s:messages/>
</feed>

Would be successful/search head is ready for traffic, a response such as:

<?xml version="1.0" encoding="UTF-8"?>
<response>
  <messages>
    <msg type="ERROR">Search Head is in detention</msg>
  </messages>
</response

Would result in that search head not receiving new traffic

Ideally this would be a switch/parameter in case someone wants to send traffic to members in detention.

Splunk setup on K8S Splunk search head clusters will have this feature, and only search head clusters...

Reproduction/Testing steps Any search head cluster has this feature, you can manually put a node in detention as per Put a search head cluster member into detention

K8s environment N/A

Proposed changes(optional) Provide either a flag or a new default that for the SHC CRD the readiness probe checks the search head status and members in manual detention as considered "not ready"

K8s collector data(optional) N/A

Additional context(optional) I've raised the related issue https://github.com/splunk/splunk-operator/issues/1321

yaroslav-nakonechnikov commented 2 months ago

agree, it adds issues.

we define startup probe timeout for 5 mins - it postpone checks, and just after searcheads are online, so ip's are assigned and deployer can work with that. but failureThreshold is set to really high number (50+, depends on period of check), so it allows deployer to finish all tasks.

it has issue, as there will be logs about about non-working deployer, but we ignore it. checking only restart reasons.

gjanders commented 2 months ago

Now logged as CSPL-2594