Stargate java heap size is wrongly set

kertzi commented 1 year ago

What happened? I'm trying to deploy k8ssandra to EKS. Stargate end up crashLoopBackOff loop with erro 137 (OOM kill).

I think reason for this is because container limits are not matched to java heap size settings. Now container memory is limited to 1Gi and java heap set to 2GB.

Did you expect to see something different? Stargate not crashing

How to reproduce it (as minimally and precisely as possible): Here is my K8ssandraCluster yaml. This is almost direct copy from examples.

apiVersion: k8ssandra.io/v1alpha1
kind: K8ssandraCluster
metadata:
  name: dev-cassandra
spec:
  reaper:
    heapSize: 256Mi
    autoScheduling:
      enabled: false
    telemetry:
      vector:
        enabled: true
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 100m
            memory: 512Mi
  cassandra:
    serverVersion: "4.1.0"
    datacenters:
      - metadata:
          name: dc1
        size: 3
        storageConfig:
          cassandraDataVolumeClaimSpec:
            storageClassName: ebs-sc
            accessModes:
              - ReadWriteOnce
            resources:
              requests:
                storage: 5Gi
        config:
          jvmOptions:
            heapSize: 512Mi
        stargate:
          size: 1
          heapSize: 256Mi
    mgmtAPIHeap: 64Mi

Here is part of Pod config from cluster after above deployed.

Controlled By:  ReplicaSet/integration-cassandra-dc1-default-stargate-deployment-6649d8c784
Containers:
  integration-cassandra-dc1-default-stargate-deployment:
    Container ID:   docker://911574837e3ed93532068d75ff2b30a8632ddd362839777705b59cf4e9b077a2
    Image:          docker.io/stargateio/stargate-4_0:v1.0.67
    Image ID:       docker-pullable://stargateio/stargate-4_0@sha256:eec28880951dcc45e6925e4ff3d6d0538b124f0607cf77d660bea64ce230b1e3
    Ports:          8080/TCP, 8081/TCP, 8082/TCP, 8084/TCP, 8085/TCP, 8090/TCP, 9042/TCP, 8609/TCP, 7000/TCP, 7001/TCP
    Host Ports:     0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP
    State:          Running
      Started:      Tue, 04 Apr 2023 13:22:49 +0300
    Last State:     Terminated
      Reason:       Error
      Exit Code:    143
      Started:      Tue, 04 Apr 2023 13:21:35 +0300
      Finished:     Tue, 04 Apr 2023 13:22:49 +0300
    Ready:          False
    Restart Count:  19
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:      200m
      memory:   512Mi
    Liveness:   http-get http://:health/checker/liveness delay=30s timeout=10s period=10s #success=1 #failure=5
    Readiness:  http-get http://:health/checker/readiness delay=30s timeout=10s period=10s #success=1 #failure=5
    Environment:
      LISTEN:                  (v1:status.podIP)
      JAVA_OPTS:              -XX:+CrashOnOutOfMemoryError -Xms268435456 -Xmx268435456 -Dstargate.unsafe.cassandra_config_path=/config/cassandra.yaml -Dstargate.cql.config_path=/config/stargate-cql.yaml
      CLUSTER_NAME:           integration-cassandra
      CLUSTER_VERSION:        4.0
      SEED:                   integration-cassandra-seed-service.default.svc
      DATACENTER_NAME:        dc1
      RACK_NAME:              default
      DISABLE_BUNDLES_WATCH:  true
      ENABLE_AUTH:            true
    Mounts:
      /config from cassandra-config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7lqpq (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  cassandra-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      integration-cassandra-dc1-cassandra-config
    Optional:  false
  kube-api-access-7lqpq:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true

Environment AWS EKS with k8s version 1.23

K8ssandra Operator version: Operator is deploy by FluxCD HelmRelease, below is helmRelease def which shows version 1.6.1 is used

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: k8ssandra
  namespace: default
spec:
  chart:
    spec:
      chart: k8ssandra-operator
      sourceRef:
        kind: HelmRepository
        name: k8ssandra
      version: 1.6.1
  interval: 30m0s
  values:
    global:
      clusterScoped: true

`Insert image tag or Git SHA here`
<!-- Try kubectl describe deployment k8ssandra-operator -->
<!-- Note: please provide operator version and not a helm chart version -->* Kubernetes version information:    `kubectl version````Kustomize Version: v4.5.7

Server Version: version.Info{Major:"1", Minor:"23+", GitVersion:"v1.23.16-eks-48e63af", GitCommit:"e6332a8a3feb9e0fe3db851878f88cb73d49dd7a", GitTreeState:"clean", BuildDate:"2023-01-24T19:18:15Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}



* Kubernetes cluster kind:

AWS EKS

* Manifests:

Given above

* K8ssandra Operator Logs:

not relevant

**Anything else we need to know?**:

┆Issue is synchronized with this [Jira Story](https://datastax.jira.com/browse/K8OP-239) by [Unito](https://www.unito.io)
┆Issue Number: K8OP-239

adejanovski commented 1 year ago

Stargate is crash looping most probably because it doesn't support Cassandra 4.1 yet: https://github.com/stargate/stargate/issues/2311

FTR, resource requirements for the Stargate pods can (and should) be set explicitly under .spec.stargate.resources : https://docs.k8ssandra.io/reference/crd/k8ssandra-operator-crds-latest/#stargatespecresources

kertzi commented 1 year ago

Thanks for fast response! I see, I will test with cassandra 4.0

kertzi commented 1 year ago

Yes crashloop stops when I downgraded cassandra version to 4.0.8 Its still wierd that java heap sizes are set wrong compared to definitions in manifest:

Containers:
  integration-cassandra-dc1-default-stargate-deployment:
    Container ID:   docker://22fa87a8f60b18b775a7c0367dfcaafa230ec1b79eaa8ccf71d12c5cd855eec6
    Image:          docker.io/stargateio/stargate-4_0:v1.0.67
    Image ID:       docker-pullable://stargateio/stargate-4_0@sha256:eec28880951dcc45e6925e4ff3d6d0538b124f0607cf77d660bea64ce230b1e3
    Ports:          8080/TCP, 8081/TCP, 8082/TCP, 8084/TCP, 8085/TCP, 8090/TCP, 9042/TCP, 8609/TCP, 7000/TCP, 7001/TCP
    Host Ports:     0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP
    State:          Running
      Started:      Tue, 04 Apr 2023 16:24:05 +0300
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 04 Apr 2023 16:18:26 +0300
      Finished:     Tue, 04 Apr 2023 16:18:54 +0300
    Ready:          True
    Restart Count:  72
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:      200m
      memory:   512Mi
    Liveness:   http-get http://:health/checker/liveness delay=30s timeout=10s period=10s #success=1 #failure=5
    Readiness:  http-get http://:health/checker/readiness delay=30s timeout=10s period=10s #success=1 #failure=5
    Environment:
      LISTEN:                  (v1:status.podIP)
      JAVA_OPTS:              -XX:+CrashOnOutOfMemoryError -Xms268435456 -Xmx268435456 -Dstargate.unsafe.cassandra_config_path=/config/cassandra.yaml -Dstargate.cql.config_path=/config/stargate-cql.yaml
      CLUSTER_NAME:           integration-cassandra
      CLUSTER_VERSION:        4.0
      SEED:                   integration-cassandra-seed-service.default.svc
      DATACENTER_NAME:        dc1
      RACK_NAME:              default
      DISABLE_BUNDLES_WATCH:  true
      ENABLE_AUTH:            true

Or do I miss something?

adejanovski commented 1 year ago

we don't change the memory requirements for the container based on the heap size. I guess that would be a nice improvement as we can predict the maximum memory that the cassandra container should be allowed to use, based on the heap size. But just to be clear, the heap size setting will be used to configure the Cassandra process, and resource requirements are set at the container definition level. You can override each of these settings to have them match each other (plan for a bit more than twice the heap size in memory limit).

k8ssandra / k8ssandra-operator

Stargate java heap size is wrongly set #938