netbox-community / netbox-chart

A Helm chart for NetBox
https://netbox.readthedocs.io/
Apache License 2.0
261 stars 153 forks source link

GKE storageClass Multi-Attach error #390

Closed D1StrX closed 2 weeks ago

D1StrX commented 1 month ago

The Helm chart version

5.0.0-beta.112

Environment Versions

Kubernetes 1.29
GKE

Custom chart values

persistence:
  enabled: true
  storageClass: "rwo"
  subPath: ""
  accessMode: ReadWriteOnce
  size: 1Gi
  existingClaim: ""
  annotations: {}

Current Behavior & Steps to Reproduce

Regarding the storage behavior mentioned in https://github.com/netbox-community/netbox-chart/issues/357, I took a deeper look into the issue. Even RWO is an issue for us when your run a cluster with multiple K8s Worker nodes. The PVC, Netbox and Netbox-worker must reside on the same Worker node, otherwise you get Multi-Attach error for volume <pvc> Volume is already used by pod(s) netbox-worker-xxx. RWX isn't available on GKE, because pd.csi.storage.gke.io doesn't support it.

And why does Netbox-worker need access to Netbox-media?

Expected Behavior

An alternative or perhaps improved documentation.

NetBox Logs

No response

LeoColomb commented 4 weeks ago

Thanks for filing this issue, @D1StrX.

And why does Netbox-worker need access to Netbox-media?

That's a good point. Would removing this mount resolve the issue you're facing?

RangerRick commented 3 weeks ago

Wondering this as well. I'd think you'd still have problems if you're running multiple nb replicas and they end up on different nodes.

D1StrX commented 3 weeks ago

As long as we don't use scriptsPersistence and reportsPersistence ... but this wouldn't resolve the main issue. When scaling up the replicas this would indeed create the same issue. A couple of solutions/directions I can think of;

LeoColomb commented 3 weeks ago

Makes sense. Then what's blocking to use the proper ReadWriteMany access mode? That would be the exact use case for this.

D1StrX commented 3 weeks ago

GKE doesn't support RWX at all. And trying this to work, is not succeeding: https://github.com/netbox-community/netbox-chart/issues/394.

LeoColomb commented 3 weeks ago

As far as I understand, it does, just not when using Compute Engine disks. I might be wrong, but if so, do you have any reference?

D1StrX commented 3 weeks ago

Several reference points:

  1. Error I am getting: failed to provision volume with StorageClass "<storageclass>": rpc error: code = InvalidArgument desc = VolumeCapabilities is invalid: specified multi writer with mount access type

  2. In Google Cloud Platform, the default storage class uses gce-persistent disk as the provisioner. However gce-persistent disk does not allow RWX mode. By default, gcePersistentDisk volume only permits readonly for multiple consumers.

  3. https://www.googlecloudcommunity.com/gc/Google-Kubernetes-Engine-GKE/pod-failed-to-use-pvc-with-standard-rwx-storageclass/m-p/796156 Solution is going straight for FileStore.

LeoColomb commented 3 weeks ago

I honestly don't know what can be done in this repository for this case. The fact GKE doesn't support ReadWriteMany volumes for some context is quite outside our expertise/ability to fix. And NetBox is not even special is that case, as an app with a database backend. All the others I've checked defined the exact same behavior.

I'm not even sure to see the use case, where ReadWriteOnce, or even ReadOnlyMany, won't cover the needs. The "active" replication should be let to the database only, I'm not sure NetBox would be very suitable for full active replicas across nodes.

D1StrX commented 2 weeks ago

Databases are not relevant in this context. "External" databases uses StatefulSets, where each pod has its own PVC/PV. If only the Netbox container (not Worker or Housekeeping) would attach to Media, Scripts and Reports PVC the issue would be fixed. When you want to run Netbox HA; use for Media, Scripts and Reports external datasources like Git or S3.

LeoColomb commented 2 weeks ago

Please give version 5.0.0-beta.137 (or above) a try. A new option has been added to allow read only volume mounts (housekeeping.readOnlyPersistence, worker.readOnlyPersistence). I believe ReadOnlyMany should then be an adequate option.

D1StrX commented 2 weeks ago

Tested and unfortunately this requires again more work on GKE... since ReadOnlyMany isn't 100% supported. https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/readonlymany-disks#create-rom-pv

I am going to close this issue, since there seems not easy/good way to solve this in the Cloud (GKE). The only suggestion I can offer is to consider whether Housekeeping and Worker actually need to attach to the three optional PVCs, or if they can be removed in the Chart, leaving only the Netbox container attached. Its better to have the functionality than 2 Netbox replicas IMHO.