oracle / mysql-operator

Create, operate and scale self-healing MySQL clusters in Kubernetes
870 stars 235 forks source link

mysql container in server pod crashes intermittently after deployment. #259

Open d0x2f opened 5 years ago

d0x2f commented 5 years ago

Is this a BUG REPORT or FEATURE REQUEST?

Choose one: BUG REPORT

Versions

MySQL Operator Version: helm chart master (c98210b2c7b176befa00aa0751db184088adfc39) Values.image.tag 0.3.0

Environment:

Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.3", GitCommit:"721bfa751924da8d1680787490c54b9179b1fed0", GitTreeState:"clean", BuildDate:"2019-02-01T20:08:12Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.5-gke.5", GitCommit:"2c44750044d8aeeb6b51386ddb9c274ff0beb50b", GitTreeState:"clean", BuildDate:"2019-02-01T23:53:25Z", GoVersion:"go1.10.8b4", Compiler:"gc", Platform:"linux/amd64"}

What happened?

When producing a cluster stateful set, often one of the pods enters a crash loop with the following logs:

error-log.txt

What you expected to happen?

All pods to start successfully

How to reproduce it (as minimally and precisely as possible)?

Here's my cluster.yaml:

---
kind: ConfigMap
apiVersion: v1
metadata:
  name: mysql-config
data:
  my.cnf: |-
    [mysqld]
    default_authentication_plugin=mysql_native_password
---
apiVersion: mysql.oracle.com/v1alpha1
kind: Cluster
metadata:
  name: alchemy-database
spec:
  members: 3
  version: 8.0.12
  config:
    name: mysql-config
  volumeClaimTemplate:
    metadata:
      name: data
    spec:
      storageClassName: standard
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          storage: 10Gi
---
apiVersion: v1
kind: Service
metadata:
  name: alchemy-database-router
  labels:
    app: alchemy-database-router
spec:
  ports:
    - name: read-write
      port: 6446
      targetPort: 6446
      protocol: TCP
    - name: read-only
      port: 6447
      targetPort: 6447
      protocol: TCP
  selector:
    app: alchemy-database-router
  type: ClusterIP
  clusterIP: None
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: alchemy-database-router
  labels:
    app: alchemy-database-router
spec:
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: alchemy-database-router
    spec:
      containers:
      - name: mysqlrouter
        image: mysql/mysql-router:8.0.12
        env:
        - name: MYSQL_PASSWORD
          valueFrom:
            secretKeyRef:
              name: alchemy-database-root-password
              key: password
        - name: MYSQL_USER
          value: root
        - name: MYSQL_PORT
          value: "3306"
        - name: MYSQL_HOST
          value: alchemy-database
        - name: MYSQL_INNODB_NUM_MEMBERS
          value: "3"
        command:
        - "/bin/bash"
        - "-cx"
        - "exec /run.sh mysqlrouter"
        ports:
          - containerPort: 6446
          - containerPort: 6447

Anything else we need to know?

mysql-operator is installed into the same namespace as the above yaml, "alchemy".

This yaml is based on some of the examples provided in this repo, however I've changed the access mode on the volume claims to ReadWriteOnce because ReadWriteMany isn't supported on GKE out of the box. Perhaps ReadWriteMany is required for mysql-operator?

By following the link at the end of the crash log I found the line:

The preceding means that normally you should not get corrupted tables unless one of the following happens:

  • Some external program is manipulating data files or index files at the same time as mysqld without locking the table properly.

Also if one pod crashes it'll continue to crash every time it's restarted, but the others remain running with no issue.

All of this makes me think it might be something to do with the access mode but I was under the impression that each pod mounts it's own PV and so ReadWriteOnce should be sufficient.

d0x2f commented 5 years ago

I've just run a test using a ReadWriteMany PVC using an nfs provisioner and there have been no crashes so far. I'd guess ReadWriteMany is indeed required then.

d0x2f commented 5 years ago

Throw out my ReadWriteMany theory, I'm still getting crashes, log attached.

crash-log.txt

bweston92 commented 5 years ago

Can you have a look at the events on the deployment/pod? Seems like you're getting a SIGABRT sent to the pod.

d0x2f commented 5 years ago

The crash may be coinciding with a node scale up event, here're the events of a pod that crashed.

7m41s       Warning   FailedScheduling         Pod    pod has unbound immediate PersistentVolumeClaims (repeated 4 times)
7m41s       Normal    Scheduled                Pod    Successfully assigned alchemy/alchemy-database-1 to gke-cluster-node-pool-e9917db4-c8r6
7m34s       Normal    TriggeredScaleUp         Pod    pod triggered scale-up: [{https://content.googleapis.com/compute/v1/projects/development-5893cdfe/zones/australia-southeast1-a/instanceGroups/gke-cluster-node-pool-e9917db4-grp 2->3 (max: 3)}]
7m33s       Normal    SuccessfulAttachVolume   Pod    AttachVolume.Attach succeeded for volume "pvc-21f9162d-43f1-11e9-8756-42010a00000a"
7m22s       Normal    Pulling                  Pod    pulling image "mysql/mysql-server:8.0.12"
7m9s        Normal    Pulled                   Pod    Successfully pulled image "mysql/mysql-server:8.0.12"
5m45s       Normal    Created                  Pod    Created container
5m45s       Normal    Started                  Pod    Started container
7m5s        Normal    Pulling                  Pod    pulling image "iad.ocir.io/oracle/mysql-agent:0.3.0"
6m38s       Normal    Pulled                   Pod    Successfully pulled image "iad.ocir.io/oracle/mysql-agent:0.3.0"
6m34s       Normal    Created                  Pod    Created container
6m34s       Normal    Started                  Pod    Started container
5m45s       Normal    Pulled                   Pod    Container image "mysql/mysql-server:8.0.12" already present on machine
2m18s       Warning   Unhealthy                Pod    Readiness probe failed: HTTP probe failed with statuscode: 503
5m58s       Warning   BackOff                  Pod    Back-off restarting failed container

I'm using the smallest instance type n1-standard-1 so scaling happens often.

bweston92 commented 5 years ago

More then likely the case.

bweston92 commented 5 years ago

@d0x2f have you been able to resolve?

d0x2f commented 5 years ago

Unfortunately not, I've switched to larger images and a larger minimum node pool but I'm still getting crashes even without a node scaling event.

I'd be curious if anyone else is able to reproduce this. I don't believe there's anything special about my gke setup.

cotocn commented 5 years ago

I had a similar problem when I increased the number of members in YAML below.

apiVersion: mysql.oracle.com/v1alpha1                                     
kind: Cluster                                                             
metadata:                                                                 
  name: mysql                                                             
spec:                                                                     
  members: 5        <--- Increase from 3 to 5                                                      
  config:                                                                 
    name: mycnf                                                           
  rootPasswordSecret:                                                     
    name: mysql-root-user-secret                                          
  volumeClaimTemplate:                                                    
    metadata:                                                             
      name: data                                                          
    spec:                                                                 
      storageClassName: oci                                               
      accessModes:                                                        
        - ReadWriteOnce                                                   
      resources:                                                          
        requests:                                                         
          storage: 50Gi                                                   
Events:
  Type     Reason                  Age                    From                     Message
  ----     ------                  ----                   ----                     -------
  Normal   Scheduled               41m                    default-scheduler        Successfully assigned mysql-operator/mysql-3 to 10.0.3.4
  Normal   SuccessfulAttachVolume  40m                    attachdetach-controller  AttachVolume.Attach succeeded for volume "ocid1.volume.oc1.phx.abyhqljtirxeg574lohvtxzqxerupv7zc725huszogaie6kghydeiz5mqz4a"
  Normal   Pulled                  40m                    kubelet, 10.0.3.4        Container image "iad.ocir.io/oracle/mysql-agent:0.3.0" already present on machine
  Normal   Created                 40m                    kubelet, 10.0.3.4        Created container
  Normal   Started                 40m                    kubelet, 10.0.3.4        Started container
  Normal   Pulled                  39m (x4 over 40m)      kubelet, 10.0.3.4        Container image "mysql/mysql-server:8.0.12" already present on machine
  Normal   Created                 39m (x4 over 40m)      kubelet, 10.0.3.4        Created container
  Normal   Started                 39m (x4 over 40m)      kubelet, 10.0.3.4        Started container
  Warning  Unhealthy               5m27s (x211 over 40m)  kubelet, 10.0.3.4        Readiness probe failed: HTTP probe failed with statuscode: 503
  Warning  BackOff                 28s (x182 over 40m)    kubelet, 10.0.3.4        Back-off restarting failed container
engmsaleh commented 4 years ago

I have the same issue, Do anyone has a solution for that?