microsoft / mssql-docker

Official Microsoft repository for SQL Server in Docker resources
MIT License
1.74k stars 760 forks source link

Kubernetes Persistence Not Working with Linux MSSQL Docker Image #642

Open MSC02 opened 4 years ago

MSC02 commented 4 years ago

Hello,

this may be the same problem as described in #136 and #441 but I could not find any solution which is working for us.

Setup:

Kubernetes v1.16.2
Node OS Image:                   Container Linux by CoreOS 2512.3.0 (Oklo)
Node Operating System:       linux
Node Architecture:                amd64
Container Runtime Version:  docker://18.6.3

I tried this with the following images:

2019-latest 2019-CU6-ubuntu-16.04 2017-CU21-ubuntu-16.04

I also tested this with the official helm chart: https://github.com/helm/charts/tree/master/stable/mssql-linux with the same results.

Also I used https://docs.microsoft.com/en-us/sql/linux/tutorial-sql-server-containers-kubernetes?view=sql-server-ver15 as an example.

Everything is working fine as long as I don't use persistence with kubernetes persistent volumes.

Cluster storage:

IsDefaultClass:        Yes
Annotations:           storageclass.kubernetes.io/is-default-class=true
Provisioner:            cluster.local/nfs-client-provisioner
Parameters:            archiveOnDelete=true
AllowVolumeExpansion:  True
MountOptions:          <none>
ReclaimPolicy:         Retain
VolumeBindingMode:     Immediate

Storage-Hardware: Dell EMC Unity XT

Volume Claim:

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: mssql-data
spec:
  volumeMode: Filesystem
  accessModes:
    - ReadWriteOnce    
  storageClassName: unity-allflash
  resources:
    requests:
      storage: 10Gi

Container configuration:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: {{ include "database.fullname" . }}
  labels:
{{ include "database.labels" . | indent 4 }}
spec:
  replicas: {{ .Values.replicaCount }}
  selector:
    matchLabels:
      app.kubernetes.io/name: {{ include "database.name" . }}
      app.kubernetes.io/instance: {{ .Release.Name }}
  template:
    metadata:
      labels:
        app.kubernetes.io/name: {{ include "database.name" . }}
        app.kubernetes.io/instance: {{ .Release.Name }}
    spec:
    {{- with .Values.imagePullSecrets }}
      imagePullSecrets:
        {{- toYaml . | nindent 8 }}
    {{- end }}     
      securityContext:
        {{- toYaml .Values.podSecurityContext | nindent 8 }}
      containers:
        - name: {{ .Chart.Name }}
          securityContext:
            {{- toYaml .Values.securityContext | nindent 12 }}
          image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
          imagePullPolicy: {{ .Values.image.pullPolicy }}
          env:
            - name: ACCEPT_EULA
              value: {{ quote .Values.mssql.accepteula }}
            - name: MSSQL_PID
              value: {{ .Values.mssql.edition }}
            - name: SA_PASSWORD
              value: {{ .Values.mssql.sapassword }}
            - name: MSSQL_DATA_DIR
              value: /mssql-data/data    
            #- name: MSSQL_LOG_DIR
            #  value: /mssql-translog/translog  
            #- name: MSSQL_BACKUP_DIR
            #  value: "/mssql-backup/backup"
            #- name: MSSQL_MASTER_DATA_FILE
            #  value: /mssql-data/master/master.mdf
            #- name: MSSQL_MASTER_LOG_FILE
            #  value: /mssql-data/master/mastlog.ldf
          ports:
            - name: http
              containerPort: 1433
              protocol: TCP 
          volumeMounts:
            - name: data
              mountPath: /mssql-data/data    
            #- name: transactionlog
            #  mountPath: /mssql-translog/translog
            #- name: backup
            #  mountPath: /mssql-backup/backup
            #- name: master
            #  mountPath: /mssql-data/master
      volumes:
      - name: data
        persistentVolumeClaim:
          claimName: mssql-data  

This configuration maps only the data directory. If I map the root directory (/var/opt/mssql), master-, log- or temp-folder, the pod crashes on startup.

With this setup the instance starts without any errors and I can reproduce the problem when creating a new database:

Create failed for Database 'test'.  (Microsoft.SqlServer.Smo)

An exception occurred while executing a Transact-SQL statement or batch. (Microsoft.SqlServer.ConnectionInfo)

MODIFY FILE encountered operating system error 31(A device attached to the system is not functioning.) while attempting to expand the physical file '/mssql-data/data/test.mdf'.
CREATE DATABASE failed. Some file names listed could not be created. Check related errors. (Microsoft SQL Server, Error: 5149)

SLQ Server Logs:

Date,Source,Severity,Message
09/18/2020 09:43:56,spid52,Unknown,/mssql-data/data/test.mdf: Operating system error 31(A device attached to the system is not functioning.) encountered.
09/18/2020 09:43:56,spid52,Unknown,Error: 17053<c/> Severity: 16<c/> State: 1.

When I check the pod\container I can see that the files are created but the file size is 0:

root@mssql-beta-database-0:/# ls
bin  boot  dev  etc  home  install.sh  lib  lib64  media  mnt  mssql-data  opt  proc  root  run  sbin  srv  sys  tmp  usr  var
root@mssql-beta-database-0:/# cd mssql-data/data/
root@mssql-beta-database-0:/mssql-data/data# ls -hl
total 0
-rw-r-----. 1 root root 0 Sep 18 09:43 test.mdf
root@mssql-beta-database-0:/mssql-data/data# df -h
Filesystem                                                                                           Size  Used Avail Use% Mounted on
overlay                                                                                              114G   30G   80G  28% /
tmpfs                                                                                                 64M     0   64M   0% /dev
tmpfs                                                                                                7.9G     0  7.9G   0% /sys/fs/cgroup
<IP>:/unity2-nfs-k8s2/staging-mssql-data-pvc-1567f3e3-3010-4bad-b81c-c29de136c727  1.0T  6.5G 1018G   1% /mssql-data/data
/dev/sda9                                                                                            114G   30G   80G  28% /etc/hosts
shm                                                                                                   64M     0   64M   0% /dev/shm
tmpfs                                                                                                7.9G   12K  7.9G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs                                                                                                7.9G     0  7.9G   0% /proc/acpi
tmpfs                                                                                                7.9G     0  7.9G   0% /proc/scsi
tmpfs                                                                                                7.9G     0  7.9G   0% /sys/firmware
root@mssql-beta-database-0:/mssql-data/data#

I know that the error could be related to our storage or persistent volume configuration but because it is working fine with other deployments including mysql, rabbitmq and redis, I am running out of ideas on this.

As mentioned before the only hint on this problem I could find was related to #136 and like petermicuch mentioned in this thread we also suspect a permission problem related to the container configuration.

TIA for every hint or idea on this problem.

839928622 commented 4 years ago

same problem on minikube. system info :

Name:               izj6c2nzmpsiu04m4rh4gyz
Roles:              master
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=izj6c2nzmpsiu04m4rh4gyz
                    kubernetes.io/os=linux
                    minikube.k8s.io/commit=0c5e9de4ca6f9c55147ae7f90af97eff5befef5f-dirty
                    minikube.k8s.io/name=minikube
                    minikube.k8s.io/updated_at=2020_09_21T14_00_49_0700
                    minikube.k8s.io/version=v1.13.0
                    node-role.kubernetes.io/master=
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Mon, 21 Sep 2020 14:00:46 +0800
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  izj6c2nzmpsiu04m4rh4gyz
  AcquireTime:     <unset>
  RenewTime:       Wed, 23 Sep 2020 20:27:27 +0800
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Wed, 23 Sep 2020 20:23:38 +0800   Mon, 21 Sep 2020 14:00:42 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 23 Sep 2020 20:23:38 +0800   Mon, 21 Sep 2020 14:00:42 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 23 Sep 2020 20:23:38 +0800   Mon, 21 Sep 2020 14:00:42 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Wed, 23 Sep 2020 20:23:38 +0800   Mon, 21 Sep 2020 14:01:05 +0800   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  192.168.0.46
  Hostname:    izj6c2nzmpsiu04m4rh4gyz
Capacity:
  cpu:                2
  ephemeral-storage:  41152812Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             8006608Ki
  pods:               110
Allocatable:
  cpu:                2
  ephemeral-storage:  41152812Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             8006608Ki
  pods:               110
System Info:
  Machine ID:                 20200522112903902132804186481140
  System UUID:                b66dbe24-fac9-45b2-84c9-6cb90c206398
  Boot ID:                    f48add6f-46ea-4250-be71-3d14501f7055
  Kernel Version:             5.4.0-31-generic
  OS Image:                   Ubuntu 20.04 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://19.3.13
  Kubelet Version:            v1.19.0
  Kube-Proxy Version:         v1.19.0
Non-terminated Pods:          (12 in total)
  Namespace                   Name                                               CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                   ----                                               ------------  ----------  ---------------  -------------  ---
  default                     api-gateway-6d5bc994c9-6gn7s                       0 (0%)        0 (0%)      0 (0%)           0 (0%)         18h
  default                     api-gateway-f8d8cf796-db4ld                        0 (0%)        0 (0%)      0 (0%)           0 (0%)         18h
  default                     jenkins-5f98bbffcf-9vgnp                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d6h
  default                     mssqldb-7f6d89d556-v77cm                           500m (25%)    0 (0%)      4G (48%)         0 (0%)         3h14m
  default                     webapp-86c58fbfbf-dzs6p                            0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d5h
  kube-system                 coredns-f9fd979d6-7jjxd                            100m (5%)     0 (0%)      70Mi (0%)        170Mi (2%)     2d6h
  kube-system                 etcd-izj6c2nzmpsiu04m4rh4gyz                       0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d6h
  kube-system                 kube-apiserver-izj6c2nzmpsiu04m4rh4gyz             250m (12%)    0 (0%)      0 (0%)           0 (0%)         2d6h
  kube-system                 kube-controller-manager-izj6c2nzmpsiu04m4rh4gyz    200m (10%)    0 (0%)      0 (0%)           0 (0%)         2d6h
  kube-system                 kube-proxy-q9wtn                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d6h
  kube-system                 kube-scheduler-izj6c2nzmpsiu04m4rh4gyz             100m (5%)     0 (0%)      0 (0%)           0 (0%)         2d6h
  kube-system                 storage-provisioner                                0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d6h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests          Limits
  --------           --------          ------
  cpu                1150m (57%)       0 (0%)
  memory             4073400320 (49%)  170Mi (2%)
  ephemeral-storage  0 (0%)            0 (0%)
  hugepages-1Gi      0 (0%)            0 (0%)
  hugepages-2Mi      0 (0%)            0 (0%)
Events:              <none>

my deploy.yaml:

# PVC and PV
# What do want?
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: mssql-pvc
spec:
  storageClassName: mylocalstorage
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 2Gi
---
# How do we want it implemented
apiVersion: v1
kind: PersistentVolume
metadata:
  name: local-storage
spec:
  storageClassName: mylocalstorage
  capacity:
    storage: 2Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "/mnt/mssqldb/"
    type: DirectoryOrCreate
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mssqldb
spec:
  selector:
    matchLabels:
      app: mssqldb
  replicas: 1
  template: # template for the pods
    metadata:
      labels:
        app: mssqldb
    spec:
      containers:
      - name: mssqldb
        image: mcr.microsoft.com/mssql/server:2019-latest
        resources:
          requests:
            cpu: 500m
            memory: 4000M

        env:
          - name: MSSQL_PID
            value: "Standard"
          - name: ACCEPT_EULA
            value: "Y"
          - name: MSSQL_SA_PASSWORD
            value: "yourStrong(!)Password"
        volumeMounts:
          - name: db-persistent-storage
            mountPath: /var/opt/mssql/data
      volumes:
        - name: db-persistent-storage
          # pointer to the configuration of HOW we want the mount to be implemented
          persistentVolumeClaim:
            claimName: mssql-pvc

---
kind: Service
apiVersion: v1
metadata:
  name: mssqldb
spec:
  selector:
    app: mssqldb
  ports:
    - name: mssql-port
      port: 1433
      nodePort: 30070
  type: NodePort

container logs:

SQL Server 2019 will run as non-root by default.
This container is running as user mssql.
To learn more visit https://go.microsoft.com/fwlink/?linkid=2099216.
2020-09-23 11:49:12.67 Server      The licensing PID was successfully processed. The new edition is [Standard Edition].
2020-09-23 11:49:13.15 Server      Setup step is copying system data file 'C:\templatedata\master.mdf' to '/var/opt/mssql/data/master.mdf'.
2020-09-23 11:49:13.16 Server      ERROR: Setup FAILED copying system data file 'C:\templatedata\master.mdf' to '/var/opt/mssql/data/master.mdf':  2(The system cannot find the file specified.)
ERROR: BootstrapSystemDataDirectories() failure (HRESULT 0x80070002)

any idea of workaround , I already ran of mind.

mehrdadpfg commented 4 years ago

Same problem here using kubespray to initiate k8s cluster v1.19.2 on ESXi, using cilium as CNI and NFS for persistent volume.

I tried both Helm chart available at helm GitHub and the Deployment yaml file provided in Microsoft website.

also I'm not using PVSCI in ESXi as mentioned in this bug https://support.microsoft.com/en-us/help/2519834/sql-server-reports-operating-system-error-1117-i-o-device-error-on-vmw

I tried these image tags:

2019-CU8-ubuntu-16.04
2019-CU8-ubuntu-18.04
2019-latest
2019-CU6-ubuntu-16.04
2019-CU6-rhel-7.6
2019-GA-ubuntu-16.04

After writing into DB, mssql restarts and the sqlcmd shows the following:

TCP Provider: Error code 0x2746
Communication link failure

This is the container log:

2020-11-02 07:39:58.78 spid9s      Error: 17053, Severity: 16, State: 1.
2020-11-02 07:39:58.78 spid9s      SQLServerLogMgr::LogWriter: Operating system error 1117(The request could not be performed because of an I/O device error.) encountered.
2020-11-02 07:39:58.79 spid9s      Write error during log flush.
2020-11-02 07:39:58.80 spid51      Error: 9001, Severity: 21, State: 4.
2020-11-02 07:39:58.80 spid51      The log for database 'master' is not available. Check the operating system error log for related error messages. Resolve any errors and restart the database.
2020-11-02 07:39:58.82 spid51      Error: 9001, Severity: 21, State: 5.
2020-11-02 07:39:58.82 spid51      The log for database 'master' is not available. Check the operating system error log for related error messages. Resolve any errors and restart the database.
2020-11-02 07:39:58.84 spid51      Error: 3314, Severity: 21, State: 3.
2020-11-02 07:39:58.84 spid51      During undoing of a logged operation in database 'master' (page (0:0) if any), an error occurred at log record ID (215:176:9). Typically, the specific failure is logged previously as an error in the operating system error log. Restore the database or file from a backup, or repair the database.
2020-11-02 07:39:58.86 spid51      Error: 3449, Severity: 21, State: 1.
2020-11-02 07:39:58.86 spid51      SQL Server must shut down in order to recover a database (database ID 1). The database is either a user database that could not be shut down or a system database. Restart SQL Server. If the database fails to recover after another startup, repair or restore the database.
2020-11-02 07:39:58.89 spid51      SQL Trace was stopped due to server shutdown. Trace ID = '1'. This is an informational message only; no user action is required.

This is the PVC file

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: mssql-data
  annotations:
spec:
  accessModes:
  - ReadWriteOnce
  storageClassName: "managed-nfs-storage"
  resources:
    requests:
      storage: 8Gi

And this is my deployment file

apiVersion: apps/v1
kind: Deployment
metadata:
  name: mssql-deployment
spec:
  replicas: 1
  selector:
     matchLabels:
       app: mssql
  template:
    metadata:
      labels:
        app: mssql
    spec:
      terminationGracePeriodSeconds: 30
      hostname: mssqlinst
      securityContext:
        fsGroup: 10001
      containers:
      - name: mssql
        image: mcr.microsoft.com/mssql/server:2019-latest
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 1433
        env:
        - name: MSSQL_PID
          value: "Developer"
        - name: ACCEPT_EULA
          value: "Y"
        - name: SA_PASSWORD
          valueFrom:
            secretKeyRef:
              name: mssql
              key: SA_PASSWORD 
        volumeMounts:
        - name: mssqldb
          mountPath: /var/opt/mssql
      volumes:
      - name: mssqldb
        persistentVolumeClaim:
          claimName: mssql-data
---
apiVersion: v1
kind: Service
metadata:
  name: mssql-deployment
spec:
  selector:
    app: mssql
  ports:
    - protocol: TCP
      port: 1433
      targetPort: 1433
  type: LoadBalancer

‍‍‍‍‍‍‍‍‍ i tried to persist other volumes like backup,master and translog. same issue. Everything is working fine as long as I don't use persistence with kubernetes persistent volumes.

Update 1: Mounting Host Path persistent volume works flawlessly for me.

mehrdadpfg commented 4 years ago

@twright-msft @LuisBosquez any help?

soroshsabz commented 3 years ago

Did you have any explanation for this issue?

MSC02 commented 3 years ago

I have only a little update on this issue: We currently have three Kubernetes clusters running in our infrastructure. On two of them we can reproduce the issue as described. On the third the issue is gone, but we currently can not explain why. The working cluster runs on different hardware and has a different storage provider (different storage class). Unfortunately I can't tell you what is causing this problem on the two other clusters. It could be the hardware, the storage provider, the Linux distro on which the nodes are running or any other part of the infrastructure. We will investigate this further and I hope I can find the root cause but at them moment I still have no clue.

soroshsabz commented 3 years ago

@amitmsft

mehrdadpfg commented 3 years ago

@MSC02 what is the storage provider on the cluster that works?

khanhvu161188 commented 3 years ago

We have the same problem here. Node run Ubuntu 20 and use NFS as persistence volume but no luck at all

soroshsabz commented 3 years ago

@amitmsft any update?

banerjeeamit commented 3 years ago

I would suggest opening a support case with Microsoft to check what issue you are hitting. SQL Linux containers do work with persistent volumes on K8s. So there is some troubleshooting required to identify what your specific root cause is.

f-camacho commented 2 years ago

Same issue here, but this time launching a database clone. Using a persistent volume mounted on /var/opt/mssql, same permissions & owners as not using persistent volumes. The underlaying filesystem is XFS and the host OS is Ubuntu 20.04. I'm using Kubernetes and the volume is mounted using Heketi/GlusterFS. Tested it with different MSSQL versions (2019, 2017) and different docker base images (Ubuntu 20.04, 18.04 and 16.04):

dbcc clonedatabase (testdb, testdbcopy)

Started executing query at Line 1
Database cloning for 'testdb' has started with target as 'testdbcopy'.
Msg 5123, Level 16, State 1, Line 1
CREATE FILE encountered operating system error 31(A device attached to the system is not functioning.) while attempting to open or create the physical file '/var/opt/mssql/testdb.mdf_MSSQL_DBCC6'.
Msg 1823, Level 16, State 8, Line 1
A database snapshot cannot be created because it failed to start.
Msg 1823, Level 16, State 2, Line 1
A database snapshot cannot be created because it failed to start.
Total execution time: 00:00:00.231

It works as expected without using persistent volumes. This issue appeared before in the past during the initialization of the engine. Microsoft fixed it, so I suspect this issue could be also fixed in the same way but it must be managed by Microsoft directly into the SQL Server internal code.

waqasdev54 commented 6 months ago

We are encountering data persistence issues with our Microsoft SQL Server image tagged 2019. Despite utilizing Persistent Volume Claims (PVCs) for data storage, we have experienced multiple instances of data loss. Can anyone confirm encountering similar behavior or provide insights into potential causes?