splunk / splunk-operator

Splunk Operator for Kubernetes
Other
210 stars 115 forks source link

Splunk Operator: slow mounting of ebs volume hence pod is keeping "container creating" state for too long #1288

Closed yaroslav-nakonechnikov closed 7 months ago

yaroslav-nakonechnikov commented 9 months ago

Please select the type of request

Enhancement

Tell us more

Describe the request as we are using EBS volumes with quite big sizes (10Tb+) for indexers, and sometimes it is requred to change node, we found that mounting of EBS and starting pod takes too much time. In our case it is 70 minutes just to start start pod after assignment to node.

after investigation, we found that k8s by default forces persmissions. ref: https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#configure-volume-permission-and-ownership-change-policy-for-pods and it takes a lot of time.

Expected behavior In documenation it is mentioned with some examples how to solve it and crd has default value for fsGroupChangePolicy = "OnRootMismatch"

vivekr-splunk commented 9 months ago

@yaroslav-nakonechnikov just wanted to check what version of splunk operator you are using

yaroslav-nakonechnikov commented 9 months ago

@vivekr-splunk crd didn't changed a lot from beginning. But i'd say 2.4 and 2.5 doesn't have that feature.

logsecvuln commented 9 months ago

@vivekr-splunk @akondur Splunk support ticket has also been raised for that matter. Please refer to the following case number "CASE [3423864]".

akondur commented 9 months ago

Hi @yaroslav-nakonechnikov , is the request here to change the fsGroupChangePolicy to OnRootMismatch?

yaroslav-nakonechnikov commented 9 months ago

request is to add support for it and inform users about potential issues with big volumes.

as a result it can be changed by default, as from my perspective it doesn't look necessary to change permissions on each mount

akondur commented 9 months ago

@yaroslav-nakonechnikov , have you tried changing the fsGroupChangePolicy to OnRootMismatch and check if that fixes the issue in your environment? This can be done my manully disabling the operator(temporarily) and testing it on one of your Splunk instances? We are currently evaluating the option on our end.

yaroslav-nakonechnikov commented 9 months ago

@akondur how? any change in statefulset/pod leads to recreate it. and crd doesn't have that option

akondur commented 9 months ago

@yaroslav-nakonechnikov You could create a simple Splunk statefulSet which attaches to EBS volumes and try reproducing the issue - post which you can change the policy to see if it changes. Alternatively before changing nodes for the pods, you could delete the operator temporarily and edit the statefulSet

yaroslav-nakonechnikov commented 9 months ago

@akondur in that case why you can't recheck it if you already know what and how to recheck?

i reported problem as a customer. now it is your step to get most of it and repeat for it. Honestly, i don't understand why i have to spin another cluster with another 11Tb disks and fill it all with some dump data? Will you pay for it?

vivekr-splunk commented 8 months ago

Hello @yaroslav-nakonechnikov, Thank you for investigating this issue and identifying a possible solution. We will replicate the problem on our end and test to see if your fix resolves it. we will get back to you soon on this

akondur commented 8 months ago

Hey @yaroslav-nakonechnikov , we have merged the change to update the fsGroupChangePolicy. Please let us know if the issue still persists and we can re-visit the issue.

yaroslav-nakonechnikov commented 8 months ago

@akondur this is good. so now, need to wait till it will be released.

as for now i don't know how to check it, knowing that fact that 2.5.0 and 2.5.1 also not working as expected.

akondur commented 8 months ago

@yaroslav-nakonechnikov We have reverted the change as we are going to release 2.5.2 this week. Will re-introduce it right after in develop. If this change is needed soon - we will make another minor release. Will update the PR here as soon as it's ready.

akondur commented 8 months ago

Hey @yaroslav-nakonechnikov , please find the merged MR into develop here. Please let me know if you're still facing issues with this change.

akondur commented 7 months ago

Closing this issue per the MR. Please re-open it if the issue still persists.

yaroslav-nakonechnikov commented 7 months ago

how it can be closed, if it is not released yet?

yaroslav-nakonechnikov commented 6 days ago

all good, it is there.