sonatype / nxrm3-helm-repository

Helm3 charts for Nexus Repository Manager
Other
33 stars 37 forks source link

Various concerns with the new `nxrm-aws-resiliency` chart #49

Open lrstanley opened 1 year ago

lrstanley commented 1 year ago

Hello! We recently started looking at shifting from the existing nexus-repository-manager helm chart, to the new nxrm-aws-resiliency chart, and we're noticing a lot of problems that I wanted to make sure to raise, as I feel others will likely raise the same concerns. Some of them have also been reported in other issues as well.

  1. Helm charts should rarely ever be specific to a single cloud provider. The implications of this means Nexus helm-based deployments are not portable across cloud providers, and will require a lot of additional work to port to other cloud providers if needed (we're currently AWS, but this limits our options). The whole point of Kubernetes abstracting everything via plugable resources is to support portability.
  2. Usage of AWS secrets-store-csi-driver should be optional, and should honestly not be recommended unless teams are already running the CSI driver. It adds quite a bit of additional complexity, that only makes sense if operators are already running these deployments.
    • Related issues: #38.
  3. Usage of external-dns controller should be optional, even if using docker functionality. For example, we use wildcard CNAME records through route53, so I can't see a need to add additional complexity in the DNS process just for Nexus.
    • Related issues: #18
    • Related PRs: #46
  4. Usage of fluentbit should be optional. I would also say that Prometheus should be supported at some point in the future as another optional extension of the helm chart. We actually started working on a Prometheus exporter that includes various metadata/metrics/etc (written in Go, 50+ different metrics atm, could be a sidecar).
    • Related issues: #18
    • Related PRs: #46
  5. Usage of AWS load-balancer-controller should be optional, even if using docker functionality. There are various security risks with dynamic loadbalancer creation, and it also prevents us from using IaC to manage our loadbalancers. We also have a lot of customizations we do to our loadbalancers, and this would prevent us from doing so.
  6. values.yaml looks like it doesn't follow common best practices/standards. The old chart more closely aligned with best practices/standards.
    • Naming for some keys (e.g. pv and pvc. nexusNs, cloudwatchNs, and externaldnsNs.)
    • Looks like environment variables aren't directly passed in. No way to add additional environment variables that actually get added to the containers/pods.
    • No options to add additional volumes/volume mounts (e.g. if needing to mount license or certs via another method).
    • No options to add additional init containers or regular containers (e.g. sidecars).
    • No options to add annotations to PVC, which means if we needed to add additional annotations for our storage provider, or for backups, we can't.
  7. The way database credentials get passed in, they are visible in the system information tab in the Nexus UI, which is a huge security risk. Anything that includes password, token, etc in the name, should be trimmed/masked in the system-information calls, and not returned in the UI.
  8. Misc other chart-level issues:
    • Why is a storageclass resource being created?

From just my initial review of the new chart, and after having dealt with 20+ charts from other vendors/platform solutions over the past two years, it seems like the new changes are a step backwards in almost every way and doesn't have the necessary functionality included for a proper enterprise deployment. Looking to see if any of the above improvements can be made.

Thanks!

vanlongme commented 1 year ago

I totally agree, this chart need to be more flexible, allow us to customize and resuse on existing envs. I think focus to develop chart for application's resource, some additional resource like: pvc, storageclass, secret, ingress can fill the name that exist on K8s cluster (that will helpful in case implement Gitops strategy like us). One more to consider:

muckelba commented 1 year ago

I agree with lrstanley, the current state of the chart is not production ready in my opinion. I've noticed some more things to add:

  1. Not every image registry is configurable
  2. deployment.replicaCount is present in values.yaml but is not used anywhere. The nexus deployment has just a 1 hardcoded
  3. There is no way to configure multiple ingresses for docker registries running on different (internal) ports
  4. There is no (documented) way to use an existing ingress solution like traefik. It seems like setting new ALBs, etc up is required
vswaminathan777 commented 1 year ago

Hello! We recently started looking at shifting from the existing nexus-repository-manager helm chart, to the new nxrm-aws-resiliency chart, and we're noticing a lot of problems that I wanted to make sure to raise, as I feel others will likely raise the same concerns. Some of them have also been reported in other issues as well.

  1. Helm charts should rarely ever be specific to a single cloud provider. The implications of this means Nexus helm-based deployments are not portable across cloud providers, and will require a lot of additional work to port to other cloud providers if needed (we're currently AWS, but this limits our options). The whole point of Kubernetes abstracting everything via plugable resources is to support portability.
  2. Usage of AWS secrets-store-csi-driver should be optional, and should honestly not be recommended unless teams are already running the CSI driver. It adds quite a bit of additional complexity, that only makes sense if operators are already running these deployments.

    • Related issues: #38.
  3. Usage of external-dns controller should be optional, even if using docker functionality. For example, we use wildcard CNAME records through route53, so I can't see a need to add additional complexity in the DNS process just for Nexus.

    • Related issues: #18
    • Related PRs: #46
  4. Usage of fluentbit should be optional. I would also say that Prometheus should be supported at some point in the future as another optional extension of the helm chart. We actually started working on a Prometheus exporter that includes various metadata/metrics/etc (written in Go, 50+ different metrics atm, could be a sidecar).

    • Related issues: #18
    • Related PRs: #46
  5. Usage of AWS load-balancer-controller should be optional, even if using docker functionality. There are various security risks with dynamic loadbalancer creation, and it also prevents us from using IaC to manage our loadbalancers. We also have a lot of customizations we do to our loadbalancers, and this would prevent us from doing so.
  6. values.yaml looks like it doesn't follow common best practices/standards. The old chart more closely aligned with best practices/standards.

    • Naming for some keys (e.g. pv and pvc. nexusNs, cloudwatchNs, and externaldnsNs.)
    • Looks like environment variables aren't directly passed in. No way to add additional environment variables that actually get added to the containers/pods.
    • No options to add additional volumes/volume mounts (e.g. if needing to mount license or certs via another method).
    • No options to add additional init containers or regular containers (e.g. sidecars).
    • No options to add annotations to PVC, which means if we needed to add additional annotations for our storage provider, or for backups, we can't.
  7. The way database credentials get passed in, they are visible in the system information tab in the Nexus UI, which is a huge security risk. Anything that includes password, token, etc in the name, should be trimmed/masked in the system-information calls, and not returned in the UI.
  8. Misc other chart-level issues:

    • Why is a storageclass resource being created?

From just my initial review of the new chart, and after having dealt with 20+ charts from other vendors/platform solutions over the past two years, it seems like the new changes are a step backwards in almost every way and doesn't have the necessary functionality included for a proper enterprise deployment. Looking to see if any of the above improvements can be made.

Thanks!

Thank you @lrstanley for your feedback on the Helm Charts! We appreciate you taking the time to compose this list. Please see our response below for each item

1.Helm charts should rarely ever be specific to a single cloud provider. The implications of this means Nexus helm-based deployments are not portable across cloud providers, and will require a lot of additional work to port to other cloud providers if needed (we're currently AWS, but this limits our options). The whole point of Kubernetes abstracting everything via plugable resources is to support portability. [Sonatype] Yes, we created separate Helm Charts originally for convenience reasons but we are considering consolidating them as we move ahead.

2.Usage of AWS secrets-store-csi-driver should be optional, and should honestly not be recommended unless teams are already running the CSI driver. It adds quite a bit of additional complexity, that only makes sense if operators are already running these deployments. [Sonatype] We’ll consider this as well when we create a consolidated Helm chart

  1. Usage of external-dns controller should be optional, even if using docker functionality. For example, we use wildcard CNAME records through route53, so I can't see a need to add additional complexity in the DNS process just for Nexus. Related issues: #18 Related PRs: #46 [Sonatype] Docker configuration is already optional and is indicated in the documentation as well. There are obviously other techniques to setup a Docker repository

  2. Usage of fluentbit should be optional. I would also say that Prometheus should be supported at some point in the future as another optional extension of the helm chart. We actually started working on a Prometheus exporter that includes various metadata/metrics/etc (written in Go, 50+ different metrics atm, could be a sidecar). Related issues: #18 Related PRs: #46 [Sonatype] Fluentbit configuration is an optional step as indicated in our help documentation and Helm Charts.

  3. Usage of AWS load-balancer-controller should be optional, even if using docker functionality. There are various security risks with dynamic loadbalancer creation, and it also prevents us from using IaC to manage our loadbalancers. We also have a lot of customizations we do to our loadbalancers, and this would prevent us from doing so. [Sonatype] Our documentation is a reference architecture and is just one of the several ways for deploying in AWS but yes this can be made optional for additional flexibility.

  4. values.yaml looks like it doesn't follow common best practices/standards. The old chart more closely aligned with best practices/standards. Naming for some keys (e.g. pv and pvc. nexusNs, cloudwatchNs, and externaldnsNs.) Looks like environment variables aren't directly passed in. No way to add additional environment variables that actually get added to the containers/pods. No options to add additional volumes/volume mounts (e.g. if needing to mount license or certs via another method). No options to add additional init containers or regular containers (e.g. sidecars). No options to add annotations to PVC, which means if we needed to add additional annotations for our storage provider, or for backups, we can't. [Sonatype] We are incrementally adding additional parameters and fixing our charts to follow better standards. We will consider the above points as well.

  5. The way database credentials get passed in, they are visible in the system information tab in the Nexus UI, which is a huge security risk. Anything that includes password, token, etc in the name, should be trimmed/masked in the system-information calls, and not returned in the UI. [Sonatype] The database password is masked currently in the System Information UI when passed as env variables or system variables but not when passed as JVM arg. We will fix this. Here is the document with instructions to pass these as env variables: https://help.sonatype.com/repomanager3/installation-and-upgrades/configuring-nexus-repository-pro-for-h2-or-postgresql

Misc other chart-level issues:

  1. Why is a storageclass resource being created? [Sonatype] The Kubernetes documentation: https://kubernetes.io/docs/concepts/storage/storage-classes/#local recommends that this should be done so that we can delay volume binding until pod scheduling. This is done using the volumeBindingMode: WaitForFirstConsumer setting
vswaminathan777 commented 1 year ago

I agree with lrstanley, the current state of the chart is not production ready in my opinion. I've noticed some more things to add:

  1. Not every image registry is configurable
  2. deployment.replicaCount is present in values.yaml but is not used anywhere. The nexus deployment has just a 1 hardcoded
  3. There is no way to configure multiple ingresses for docker registries running on different (internal) ports
  4. There is no (documented) way to use an existing ingress solution like traefik. It seems like setting new ALBs, etc up is required

Thank you @muckelba for taking the time to provide your feedback on the Helm Charts. Please see our response below against each item:

Not every image registry is configurable [Sonatype] Can you please elaborate to help us understand this better? Are you talking about pulling the images from repositories other than Docker? Or is this for configuring different registries that Nexus Repository should support?

Not every image registry is configurable deployment.replicaCount is present in values.yaml but is not used anywhere. The nexus deployment has just a 1 hardcoded [Sonatype] This Helm chart was created for deploying a single instance Nexus Repository across 2 AZ's and this is not for multi-node active/active deployment. Hence the replica count is restricted to 1.
Here is the reference architecture: https://help.sonatype.com/repomanager3/planning-your-implementation/resiliency-and-high-availability/single-node-cloud-resilient-deployment-example-using-aws

There is no way to configure multiple ingresses for docker registries running on different (internal) ports [Sonatype] We have already provided a couple of options as part of our reference architecture to configure multiple Docker Repositories. Here is the link https://help.sonatype.com/repomanager3/planning-your-implementation/resiliency-and-high-availability/single-node-cloud-resilient-deployment-example-using-aws#SingleNodeCloudResilientDeploymentExampleUsingAWS-Requirements

There is no (documented) way to use an existing ingress solution like traefik. It seems like setting new ALBs, etc up is required [Sonatype] Our documentation is a reference architecture and is just one of the several ways for deploying Nexus Repository in AWS. Obviously, there can be variations to this reference architecture which should be implemented with careful consideration. However, we can make the ALB configuration as optional which can provide users with additional flexibility.

muckelba commented 1 year ago

Thank you for the quick response and taking the time to answer.

Not every image registry is configurable [Sonatype] Can you please elaborate to help us understand this better? Are you talking about pulling the images from repositories other than Docker? Or is this for configuring different registries that Nexus Repository should support?

Sorry, i should have clarified that more precisely. I'm talking about 4 containers (external-dns, fluent-bit, create-nexus-work-dir, directory-creator) using hard coded images (k8s.gcr.io/external-dns/external-dns, amazon/aws-for-fluent-bit, ubuntu, busybox) with no way to configure their registries. This makes pulling impossible in environments without direct access to those registries.

deployment.replicaCount is present in values.yaml but is not used anywhere. The nexus deployment has just a 1 hardcoded [Sonatype] This Helm chart was created for deploying a single instance Nexus Repository across 2 AZ's and this is not for multi-node active/active deployment. Hence the replica count is restricted to 1. Here is the reference architecture: https://help.sonatype.com/repomanager3/planning-your-implementation/resiliency-and-high-availability/single-node-cloud-resilient-deployment-example-using-aws

That is fine, but i don't understand why there is a variable for that in values.yaml then. Looks like an oversight to me.

There is no way to configure multiple ingresses for docker registries running on different (internal) ports [Sonatype] We have already provided a couple of options as part of our reference architecture to configure multiple Docker Repositories. Here is the link https://help.sonatype.com/repomanager3/planning-your-implementation/resiliency-and-high-availability/single-node-cloud-resilient-deployment-example-using-aws#SingleNodeCloudResilientDeploymentExampleUsingAWS-Requirements

I see, thank you. So i assume that is not something that should be done directly in this chart.

There is no (documented) way to use an existing ingress solution like traefik. It seems like setting new ALBs, etc up is required [Sonatype] Our documentation is a reference architecture and is just one of the several ways for deploying Nexus Repository in AWS. Obviously, there can be variations to this reference architecture which should be implemented with careful consideration. However, we can make the ALB configuration as optional which can provide users with additional flexibility.

Making the ALB optional would be a good improvement.

muckelba commented 9 months ago

For your information, the nexus helm team released a new chart which addresses almost every concern stated in here: https://github.com/sonatype/nxrm3-ha-repository

lrstanley commented 8 months ago

Sorry for the delay. Our team should be looking at the new chart soon and will report back on if it resolves most of the issues we've previously raised.