splunk / splunk-operator

Splunk Operator for Kubernetes
Other
210 stars 115 forks source link

App Framework: Add option to use path style s3 URLs #1291

Closed paheath closed 3 months ago

paheath commented 9 months ago

Please select the type of request

Enhancement

Tell us more

Describe the request I am deploying the operator in an on-prem environment with a storage solution that only supports path style s3 URLs. As far as I can tell, the operator defaults to using virtual host style s3 URLs to download apps. I propose making the current behavior remain the default, and provide an option in the AppFramework spec to explicitly set the s3 URLs to path style. I rebuilt the operator with S3ForcePathStyle: aws.Bool(true) added here and the app framework worked as expected.

Smartstore offers a similar option to specify the url version, and defaults to path style. See remote.s3.url_version here.

Expected behavior Force the s3 client to use path style URLs when downloading apps, when set as such in the AppFramework spec.

Splunk setup on K8S SearchHeadCluster, IndexerCluster, ClusterManager, LicenseManager, MonitoringConsole, and Standalone heavy forwarder.

Reproduction/Testing steps Enable path style s3 URLs via the AppFramework spec. Verify that apps are correctly downloaded and installed.

K8s environment On-prem k8s cluster with on-prem s3-compatible NAS.

yaroslav-nakonechnikov commented 9 months ago

i guess this is related: https://github.com/splunk/splunk-operator/issues/1030#issuecomment-1429444280

vivekr-splunk commented 7 months ago

Hello @yaroslav-nakonechnikov @paheath we will work on this change and get back to you

akondur commented 6 months ago

Hello @paheath , we are exploring possible solutions to the path style S3 URLs. Meanwhile, can you please provide an example of the working(with the modified Splunk operator image) appFramework configurations for the path style URLs?

Also, path style URLs will be discontinued per AWS documentation.

Currently, Amazon S3 supports both virtual-hosted–style and path-style URL access in all AWS Regions. However, path-style URLs will be discontinued in the future. For more information, see the following Important note.
paheath commented 6 months ago

This is an excerpt from my helm chart, and the underlying operator image is modified as indicated in the original bug description. I don't think any of the value substitutions necessarily impact the functionality. I've defined it in the yaml as documented here https://splunk.github.io/splunk-operator/AppFramework.html

appRepo:
  appsRepoPollIntervalSeconds: {{ .Values.configPollInterval }}
  defaults:
    volumeName: {{ .Values.volumeName }}
  appSources:
  - name: node
    location: node/
    scope: local
  volumes:
  - name: {{ .Values.volumeName }}
    storageType: s3
    path: {{ .Values.bucketPath }}/
    provider: aws
    region: {{ .Values.bucketRegion }}
    endpoint: {{ .Values.bucketEndpoint }}
    secretRef: {{ .Values.secretRef }}
akondur commented 6 months ago

Hi @paheath , thanks for the example above. To further test our solution, are you able to let us know the storage provider being used to test path style S3 URLs? Currently, by default AWS S3 buckets support both path style as well as virtual hosted. I was able to test path style specifically on S3 buckets.

paheath commented 6 months ago

I'm testing against an on-prem s3-compatible NAS. I think testing against any s3-compatible storage might be sufficient, as long as you can confirm the outbound request is hitting the path-style endpoint when configured to do so. Maybe even locally block outbound traffic to the virtual endpoint. Testing might be similar to how the smartstore path-style config is tested.

akondur commented 6 months ago

@paheath Are you able to test the changes in the MR to see if its working before we merge? If there is something missing, please comment on the MR or here it will be fixed.

akondur commented 5 months ago

@paheath Please let us know if this solution works so we can merge it.

paheath commented 5 months ago

Unfortunately I can't get this change to work. I'm seeing my clustermanager instance reporting Ready, but all the apps in the description report this:

        appDeploymentInfo:                                                                                                                                                                    
        - appName: myapp.tgz                                                                                                                                                           
          appPackageTopFolder: ""                                                                                                                                                             
          deployStatus: 1                                                                                                                                                                     
          isUpdate: false                                                                                                                                                                     
          objectHash: <hash>                                                                                                                                        
          phaseInfo:                                                                                                                                                                          
            failCount: 3                                                                                                                                                                      
            phase: download                                                                                                                                                                   
            status: 199                                                                                                                                                                       
          repoState: 1

and the associated indexer cluster never reconciles. I don't see the apps appear in the pod under /opt/splunk/etc/apps or /opt/splunk/etc/manager-apps

akondur commented 5 months ago

Hey @paheath , can you share any Splunk Operator pod logs indicating any errors?

The CR status code 199 indicates that the app package was not downloaded properly.

paheath commented 5 months ago

Appears to be running through this periodically for the nodes using app framework:

2024-06-04T00:47:27.481032478Z  INFO    updatePplnWorkerPhaseInfo   changing the status {"controller": "licensemanager", "controllerGroup": "enterprise.splunk.com", "controllerKind": "LicenseManager", "LicenseManager": {"name":"lm","namespace":"test"}, "namespace": "test", "name": "lm", "reconcileID": "56c9a258-e763-484e-9ffe-6888469133de", "appName": "app.tgz", "old status": "Download In Progress", "new status": "Download Pending"}
2024-06-04T00:47:27.657331829Z  INFO    downloadPhaseManager    Download worker got a run slot  {"controller": "licensemanager", "controllerGroup": "enterprise.splunk.com", "controllerKind": "LicenseManager", "LicenseManager": {"name":"lm","namespace":"test"}, "namespace": "test", "name": "lm", "reconcileID": "56c9a258-e763-484e-9ffe-6888469133de", "name": "lm", "namespace": "test", "App name": "app.tgz", "digest": "<digest>"} 
2024-06-04T00:47:27.663811632Z  INFO    isAppAlreadyDownloaded  App not present on operator pod {"controller": "licensemanager", "controllerGroup": "enterprise.splunk.com", "controllerKind": "LicenseManager", "LicenseManager": {"name":"lm","namespace":"test"}, "namespace": "test", "name": "lm", "reconcileID": "56c9a258-e763-484e-9ffe-6888469133de", "app name": "app.tgz"}
2024-06-04T00:47:27.663872366Z  INFO    updatePplnWorkerPhaseInfo   changing the status {"controller": "licensemanager", "controllerGroup": "enterprise.splunk.com", "controllerKind": "LicenseManager", "LicenseManager": {"name":"lm","namespace":"test"}, "namespace": "test", "name": "lm", "reconcileID": "56c9a258-e763-484e-9ffe-6888469133de", "appName": "app.tgz", "old status": "Download Pending", "new status": "Download In Progress"}
2024-06-04T00:47:27.664103782Z  INFO    GetRemoteStorageClient  Creating the client {"controller": "licensemanager", "controllerGroup": "enterprise.splunk.com", "controllerKind": "LicenseManager", "LicenseManager": {"name":"lm","namespace":"test"}, "namespace": "test", "name": "lm", "reconcileID": "56c9a258-e763-484e-9ffe-6888469133de", "name": "lm", "namespace": "test", "volume": "config-repo", "bucket": "<bucket>", "bucket path": "lic_manager/"}
2024-06-04T00:47:27.664283386Z  INFO    InitAWSClientSession    AWS Client Session initialization successful.   {"controller": "licensemanager", "controllerGroup": "enterprise.splunk.com", "controllerKind": "LicenseManager", "LicenseManager": {"name":"lm","namespace":"test"}, "namespace": "test", "name": "lm", "reconcileID": "56c9a258-e763-484e-9ffe-6888469133de", "region": "zone1", "TLS Version": "TLS 1.2"}
2024-06-04T00:47:27.820996027Z  ERROR   DownloadApp Unable to download item {"controller": "licensemanager", "controllerGroup": "enterprise.splunk.com", "controllerKind": "LicenseManager", "LicenseManager": {"name":"lm","namespace":"test"}, "namespace": "test", "name": "lm", "reconcileID": "56c9a258-e763-484e-9ffe-6888469133de", "remoteFile": "lic_manager/app.tgz", "localFile": "/opt/splunk/appframework/downloadedApps/test/LicenseManager/lm/local/lic_manager/app.tgz_<etag>", "etag": "<etag>", "RemoteFile": "lic_manager/app.tgz", "error": "stream error: stream ID 7; NO_ERROR; received from peer"}
github.com/splunk/splunk-operator/pkg/splunk/client.(*AWSS3Client).DownloadApp
    /workspace/pkg/splunk/client/awss3client.go:277
github.com/splunk/splunk-operator/pkg/splunk/enterprise.(*RemoteDataClientManager).DownloadApp
    /workspace/pkg/splunk/enterprise/util.go:842
github.com/splunk/splunk-operator/pkg/splunk/enterprise.(*PipelineWorker).download
    /workspace/pkg/splunk/enterprise/afwscheduler.go:497
2024-06-04T00:47:27.821131931Z  ERROR   PipelineWorker.Download()   unable to download app  {"controller": "licensemanager", "controllerGroup": "enterprise.splunk.com", "controllerKind": "LicenseManager", "LicenseManager": {"name":"lm","namespace":"test"}, "namespace": "test", "name": "lm", "reconcileID": "56c9a258-e763-484e-9ffe-6888469133de", "name": "lm", "namespace": "test", "App name": "app.tgz", "objectHash": "<digest>", "appName": "app.tgz", "error": "stream error: stream ID 7; NO_ERROR; received from peer"}
github.com/splunk/splunk-operator/pkg/splunk/enterprise.(*PipelineWorker).download
    /workspace/pkg/splunk/enterprise/afwscheduler.go:499
paheath commented 5 months ago

This is the cluster manager app framework spec I'm using. Same as before with s3PathUrl: true set.

appRepo:
  appsRepoPollIntervalSeconds: {{ .Values.configPollInterval }}
  defaults:
    volumeName: {{ .Values.volumeName }}
  appSources:
  - name: node
    location: node/
    scope: local
  volumes:
  - name: {{ .Values.volumeName }}
    storageType: s3
    path: {{ .Values.bucketPath }}/
    provider: aws
    region: {{ .Values.bucketRegion }}
    s3PathUrl: true
    endpoint: {{ .Values.bucketEndpoint }}
    secretRef: {{ .Values.secretRef }}
akondur commented 5 months ago

Hey @paheath , whilst we are debugging further were you able to successfully install the new CRDs on the new cluster before deploying the clusterManager CR? Please let us know.

paheath commented 5 months ago

Yes, I updated the CRDs beforehand. And the cluster manager accepted the s3PathUrl setting.

paheath commented 5 months ago

Well, maybe it did not take. In the cluster manager spec s3PathUrl is set to true. But when I describe the cluster manager, I see status.Smartstore.Volumes.s3PathUrl is false. Was s3PathUrl added for smartstore also?

paheath commented 5 months ago

Disregard, I see status.AppContext.AppRepo.AppSources.Volumes.s3PathUrl is set to true as expected. I didn't catch that the false setting was in the smartstore status section.

akondur commented 5 months ago

Thank you @paheath . I believe we are setting the pathStyleUrl in the AWS S3 client. It is an update of the S3 client(vs during creation in your successful example here) before creating the downloader. Some posts online don't recommend updating the client once created. I will try and cater the changes to update this option during creation.

akondur commented 5 months ago

@paheath Are you able to try it out with the latest changes?

akondur commented 5 months ago

@paheath Please let us know if the latest changes are working.

paheath commented 5 months ago

Forgive me, my bandwidth is limited at the moment. I will do my best to get to this today.

paheath commented 5 months ago

With the latest patch I'm seeing the same "unable to download item" error logs as before. The general behavior is also the same, blocking indexer cluster creation.

akondur commented 5 months ago

Hi @paheath , thank you for testing. Are you able to provide us Splunk operator pod logs similar to this:

2024-06-06T01:03:17.019639356Z  INFO    InitAWSClientSession    Setting up AWS SDK client       {"controller": "standalone", "controllerGroup": "enterprise.splunk.com", "controllerKind": "Standalone", "Standalone": {"name":"example","namespace":"splunk-operator"}, "namespace": "splunk-operator", "name": "ido", "reconcileID": "4c684039-fe1b-4bea-b550-ce618f2ef57e", "regionWithEndpoint": "us-west-2|https://s3-us-west-2.amazonaws.com", "pathStyleUrl": true}

The changes in the MR are made are keeping in mind this issue's description and changes were made here.

paheath commented 5 months ago

I see similar logs for all nodes using the app framework (standalone, licensemanager, clustermanager)

2024-06-14T21:32:41.612345298Z  INFO    InitAWSClientSession    Setting up AWS SDK client       {"controller": "licensemanager", "controllerGroup": "enterprise.splunk.com", "controllerKind": "LicenseManager", "LicenseManager": {"name":"lm","namespace":"test"}, "namespace": "test", "name": "lm", "reconcileID": "<id>", "regionWithEndpoint": "zone1|https://<endpoint-fqdn>", "pathStyleUrl": true}
2024-06-14T21:32:41.61252801Z   INFO    InitAWSClientSession    AWS Client Session initialization successful.   {"controller": "licensemanager", "controllerGroup": "enterprise.splunk.com", "controllerKind": "LicenseManager", "LicenseManager": {"name":"lm","namespace":"test"}, "namespace": "test", "name": "lm", "reconcileID": "<id>", "region": "<region>", "TLS Version": "TLS 1.2"}
akondur commented 5 months ago

Hey @paheath , in the MR, the field S3ForcePathStyle of aws.Config is being set here per your original request. Were there any other changes made to make this work? If not, are you able to open a customer JIRA with Splunk Support so we can debug the issue further?

paheath commented 4 months ago

I've been able to test this a little more thoroughly today. I only had to add that one line to make this work previously, but I was testing on top of 2.4.0. I was able to reproduce this successfully on top of 2.4.0 today, but cherrypicking the one-line change on top of 2.5.2 did not work. Can you think of anything that has changed between 2.4.0 and 2.5.2 that would affect the behavior of the aws s3 client? I compared the two releases, but I couldn't see anything obvious. I assume whatever is breaking this in 2.5.2 is also breaking your PR.

akondur commented 3 months ago

Hi @paheath , after the comparison between 2.4.0 and 2.5.2 I couldn't see any major differences that would cause the aws sdk client to behave differently.

We just released 2.6.0. The MR has been rebased. Could you please try with the new version?

akondur commented 3 months ago

Hey @paheath , did you get a chance to try with 2.6.0? If it's not working can you please open a Splunk support case with these details?

akondur commented 3 months ago

Closing the issue for now. Please re-open a Splunk support ticket if the issue persists.