Can't start scylla with default helm chart because very small volume size

gecube commented 2 years ago

Hello!

I faced the issue that when I follow the instructions described on the page https://operator.docs.scylladb.com/stable/helm.html I couldn't get the running scylla cluster. It looks like that default PV size is 10GB:

apiVersion: scylla.scylladb.com/v1
kind: ScyllaCluster
metadata:
  annotations:
    meta.helm.sh/release-name: scylla-scylla
    meta.helm.sh/release-namespace: scylla
  labels:
    app.kubernetes.io/managed-by: Helm
    helm.toolkit.fluxcd.io/name: scylla
    helm.toolkit.fluxcd.io/namespace: flux-system
  name: scylla-scylla
  namespace: scylla
spec:
  agentRepository: scylladb/scylla-manager-agent
  agentVersion: 2.5.2
  datacenter:
    name: us-east-1
    racks:
    - agentResources:
        requests:
          cpu: 50m
          memory: 10M
      members: 3
      name: us-east-1a
      resources:
        limits:
          cpu: 1
          memory: 4Gi
        requests:
          cpu: 1
          memory: 4Gi
      scyllaAgentConfig: scylla-agent-config
      scyllaConfig: scylla-config
      storage:
        capacity: 10Gi
  repository: scylladb/scylla
  version: 4.5.1

if so the pod is failing with the next error message:

I1230 14:38:26.581342       1 operator/sidecar.go:158] sidecar version "v1.6.0-7-gac9d88f"
I1230 14:38:26.581437       1 flag/flags.go:59] FLAG: --burst="5"
I1230 14:38:26.581445       1 flag/flags.go:59] FLAG: --cpu-count="1"
I1230 14:38:26.581448       1 flag/flags.go:59] FLAG: --help="false"
I1230 14:38:26.581452       1 flag/flags.go:59] FLAG: --kubeconfig=""
I1230 14:38:26.581456       1 flag/flags.go:59] FLAG: --loglevel="2"
I1230 14:38:26.581461       1 flag/flags.go:59] FLAG: --namespace="scylla"
I1230 14:38:26.581464       1 flag/flags.go:59] FLAG: --qps="2"
I1230 14:38:26.581469       1 flag/flags.go:59] FLAG: --secret-name="scylla-scylla-auth-token"
I1230 14:38:26.581472       1 flag/flags.go:59] FLAG: --service-name="scylla-scylla-us-east-1-us-east-1a-0"
I1230 14:38:26.581475       1 flag/flags.go:59] FLAG: --v="2"
I1230 14:38:26.581847       1 operator/sidecar.go:218] "Waiting for single service informer caches to sync"
I1230 14:38:26.682470       1 operator/sidecar.go:235] "Waiting for Service" Service="scylla/scylla-scylla-us-east-1-us-east-1a-0"
I1230 14:38:26.686835       1 operator/sidecar.go:269] "Waiting for Pod To have scylla ContainerID set" Pod="scylla/scylla-scylla-us-east-1-us-east-1a-0"
E1230 14:38:26.691850       1 cache/reflector.go:138] k8s.io/client-go@v0.22.3/tools/cache/reflector.go:167: Failed to watch *v1.Pod: unknown (get pods)
E1230 14:38:28.203022       1 cache/reflector.go:138] k8s.io/client-go@v0.22.3/tools/cache/reflector.go:167: Failed to watch *v1.Pod: unknown (get pods)
I1230 14:38:28.203142       1 operator/sidecar.go:323] "Waiting for NodeConfig's data ConfigMap " Selector="scylla-operator.scylladb.com/config-map-type=NodeConfigData,scylla-operator.scylladb.com/owner-uid=5488e48b-c678-4766-ad3b-37e2126c22a2"
I1230 14:38:28.208418       1 operator/sidecar.go:385] "Starting scylla"
I1230 14:38:28.208433       1 config/config.go:64] Setting up scylla.yaml
I1230 14:38:28.208578       1 config/config.go:96] "no scylla.yaml config map available"
I1230 14:38:28.211683       1 config/config.go:68] Setting up cassandra-rackdc.properties
I1230 14:38:28.211727       1 config/config.go:157] "unable to read properties" file="/mnt/scylla-config/cassandra-rackdc.properties"
I1230 14:38:28.211845       1 config/config.go:72] Setting up entrypoint script
I1230 14:38:28.227197       1 config/config.go:253] "Scylla version detected" version={version:{Major:4 Minor:5 Patch:1 Pre:[] Build:[]} unknown:false}
I1230 14:38:28.227270       1 config/config.go:282] "Scylla entrypoint" Command="/docker-entrypoint.py --developer-mode=0 --overprovisioned=1 --smp=1 --prometheus-address=0.0.0.0 --listen-address=0.0.0.0 --broadcast-address=10.245.89.175 --broadcast-rpc-address=10.245.89.175 --seeds=10.245.89.175"
I1230 14:38:28.227340       1 cache/shared_informer.go:240] Waiting for caches to sync for Prober
I1230 14:38:28.227358       1 cache/shared_informer.go:247] Caches are synced for Prober 
I1230 14:38:28.227367       1 operator/sidecar.go:414] "Starting Prober server"
I1230 14:38:28.227599       1 sidecar/controller.go:170] "Starting controller" Controller="SidecarController"
I1230 14:38:28.227611       1 cache/shared_informer.go:240] Waiting for caches to sync for SidecarController
I1230 14:38:28.227619       1 cache/shared_informer.go:247] Caches are synced for SidecarController 
running: (['/opt/scylladb/scripts/scylla_dev_mode_setup', '--developer-mode', '0'],)
running: (['/opt/scylladb/scripts/scylla_io_setup'],)
ERROR:root:Filesystem at /var/lib/scylla/data has only 9910345728 bytes available; that is less than the recommended 10 GB. Please free up space and run scylla_io_setup again.

failed!
Traceback (most recent call last):
  File "/docker-entrypoint.py", line 27, in <module>
    setup.io()
  File "/scyllasetup.py", line 67, in io
    self._run(['/opt/scylladb/scripts/scylla_io_setup'])
  File "/scyllasetup.py", line 37, in _run
    subprocess.check_call(*args, **kwargs)
  File "/opt/scylladb/python3/lib64/python3.9/subprocess.py", line 373, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/opt/scylladb/scripts/scylla_io_setup']' returned non-zero exit status 1.
E1230 14:38:31.835289       1 sidecar/probes.go:172] "healthz probe: can't connect to JMX" err="dial tcp 10.244.2.190:10001: connect: connection refused" Service="scylla/scylla-scylla-us-east-1-us-east-1a-0"
E1230 14:38:41.835672       1 sidecar/probes.go:172] "healthz probe: can't connect to JMX" err="dial tcp 10.244.2.190:10001: connect: connection refused" Service="scylla/scylla-scylla-us-east-1-us-east-1a-0"
E1230 14:38:51.836392       1 sidecar/probes.go:172] "healthz probe: can't connect to JMX" err="dial tcp 10.244.2.190:10001: connect: connection refused" Service="scylla/scylla-scylla-us-east-1-us-east-1a-0"
E1230 14:39:01.835292       1 sidecar/probes.go:172] "healthz probe: can't connect to JMX" err="dial tcp 10.244.2.190:10001: connect: connection refused" Service="scylla/scylla-scylla-us-east-1-us-east-1a-0"
E1230 14:39:11.835123       1 sidecar/probes.go:172] "healthz probe: can't connect to JMX" err="dial tcp 10.244.2.190:10001: connect: connection refused" Service="scylla/scylla-scylla-us-east-1-us-east-1a-0"
E1230 14:39:21.835211       1 sidecar/probes.go:172] "healthz probe: can't connect to JMX" err="dial tcp 10.244.2.190:10001: connect: connection refused" Service="scylla/scylla-scylla-us-east-1-us-east-1a-0"
E1230 14:39:31.835776       1 sidecar/probes.go:172] "healthz probe: can't connect to JMX" err="dial tcp 10.244.2.190:10001: connect: connection refused" Service="scylla/scylla-scylla-us-east-1-us-east-1a-0"
E1230 14:39:41.834980       1 sidecar/probes.go:172] "healthz probe: can't connect to JMX" err="dial tcp 10.244.2.190:10001: connect: connection refused" Service="scylla/scylla-scylla-us-east-1-us-east-1a-0"
E1230 14:39:51.835903       1 sidecar/probes.go:172] "healthz probe: can't connect to JMX" err="dial tcp 10.244.2.190:10001: connect: connection refused" Service="scylla/scylla-scylla-us-east-1-us-east-1a-0"
E1230 14:40:01.835599       1 sidecar/probes.go:172] "healthz probe: can't connect to JMX" err="dial tcp 10.244.2.190:10001: connect: connection refused" Service="scylla/scylla-scylla-us-east-1-us-east-1a-0"
E1230 14:40:11.834676       1 sidecar/probes.go:172] "healthz probe: can't connect to JMX" err="dial tcp 10.244.2.190:10001: connect: connection refused" Service="scylla/scylla-scylla-us-east-1-us-east-1a-0"
E1230 14:40:21.835544       1 sidecar/probes.go:172] "healthz probe: can't connect to JMX" err="dial tcp 10.244.2.190:10001: connect: connection refused" Service="scylla/scylla-scylla-us-east-1-us-east-1a-0"
E1230 14:40:31.834940       1 sidecar/probes.go:172] "healthz probe: can't connect to JMX" err="dial tcp 10.244.2.190:10001: connect: connection refused" Service="scylla/scylla-scylla-us-east-1-us-east-1a-0"
E1230 14:40:41.835909       1 sidecar/probes.go:172] "healthz probe: can't connect to JMX" err="dial tcp 10.244.2.190:10001: connect: connection refused" Service="scylla/scylla-scylla-us-east-1-us-east-1a-0"
E1230 14:40:51.835099       1 sidecar/probes.go:172] "healthz probe: can't connect to JMX" err="dial tcp 10.244.2.190:10001: connect: connection refused" Service="scylla/scylla-scylla-us-east-1-us-east-1a-0"
E1230 14:41:01.835945       1 sidecar/probes.go:172] "healthz probe: can't connect to JMX" err="dial tcp 10.244.2.190:10001: connect: connection refused" Service="scylla/scylla-scylla-us-east-1-us-east-1a-0"
E1230 14:41:11.835740       1 sidecar/probes.go:172] "healthz probe: can't connect to JMX" err="dial tcp 10.244.2.190:10001: connect: connection refused" Service="scylla/scylla-scylla-us-east-1-us-east-1a-0"
E1230 14:41:21.836038       1 sidecar/probes.go:172] "healthz probe: can't connect to JMX" err="dial tcp 10.244.2.190:10001: connect: connection refused" Service="scylla/scylla-scylla-us-east-1-us-east-1a-0"
E1230 14:41:31.835636       1 sidecar/probes.go:172] "healthz probe: can't connect to JMX" err="dial tcp 10.244.2.190:10001: connect: connection refused" Service="scylla/scylla-scylla-us-east-1-us-east-1a-0"
E1230 14:41:41.834754       1 sidecar/probes.go:172] "healthz probe: can't connect to JMX" err="dial tcp 10.244.2.190:10001: connect: connection refused" Service="scylla/scylla-scylla-us-east-1-us-east-1a-0"

I think we need to make the defaults more reasonable and fix default capacity at least to 15GiB: https://github.com/scylladb/scylla-operator/blob/6e9424fa2c4206c1e3e6fd74b9398e5a36d91f26/helm/scylla/values.yaml#L58

tnozicka commented 2 years ago

yeah, I guess there is some filesystem overhead, and we should raise the default

Anik-saha commented 2 years ago

The issue is still very much live

violinorg commented 1 year ago

The issue is still very much live

mykaul commented 1 year ago

The issue is still very much live

The patch was not merged yet, but you may be able to provide feedback - does it solve the issue for you?

scylla-operator-bot[bot] commented 4 months ago

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 30d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out

/lifecycle stale

gecube commented 4 months ago

/remove-lifecycle stale

scylla-operator-bot[bot] commented 3 months ago

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 30d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out

/lifecycle stale

Ret2Me commented 2 months ago

Hi, any update? How can i solve this issue?

rzetelskik commented 2 months ago

@Ret2Me if you want to take this, you should probably first try to reproduce the issue, and then try fixing it by raising the storage capacity requirements in the helm charts' default values (and setting a correspondingly high fs.aio-max-nr in sysctls: see e.g. https://github.com/scylladb/scylla-operator/pull/1013) and then update the generated files. Feel free to ask if you run into any obstacles.

scylla-operator-bot[bot] commented 1 month ago

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 30d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out

/lifecycle rotten

gecube commented 1 month ago

/remove-lifecycle rotten

scylla-operator-bot[bot] commented 1 week ago

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 30d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out

/lifecycle stale

gecube commented 1 week ago

/remove-lifecycle stale

scylladb / scylla-operator

Can't start scylla with default helm chart because very small volume size #906