Open MSC02 opened 4 years ago
same problem on minikube. system info :
Name: izj6c2nzmpsiu04m4rh4gyz
Roles: master
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=izj6c2nzmpsiu04m4rh4gyz
kubernetes.io/os=linux
minikube.k8s.io/commit=0c5e9de4ca6f9c55147ae7f90af97eff5befef5f-dirty
minikube.k8s.io/name=minikube
minikube.k8s.io/updated_at=2020_09_21T14_00_49_0700
minikube.k8s.io/version=v1.13.0
node-role.kubernetes.io/master=
Annotations: kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Mon, 21 Sep 2020 14:00:46 +0800
Taints: <none>
Unschedulable: false
Lease:
HolderIdentity: izj6c2nzmpsiu04m4rh4gyz
AcquireTime: <unset>
RenewTime: Wed, 23 Sep 2020 20:27:27 +0800
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Wed, 23 Sep 2020 20:23:38 +0800 Mon, 21 Sep 2020 14:00:42 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Wed, 23 Sep 2020 20:23:38 +0800 Mon, 21 Sep 2020 14:00:42 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Wed, 23 Sep 2020 20:23:38 +0800 Mon, 21 Sep 2020 14:00:42 +0800 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Wed, 23 Sep 2020 20:23:38 +0800 Mon, 21 Sep 2020 14:01:05 +0800 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 192.168.0.46
Hostname: izj6c2nzmpsiu04m4rh4gyz
Capacity:
cpu: 2
ephemeral-storage: 41152812Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 8006608Ki
pods: 110
Allocatable:
cpu: 2
ephemeral-storage: 41152812Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 8006608Ki
pods: 110
System Info:
Machine ID: 20200522112903902132804186481140
System UUID: b66dbe24-fac9-45b2-84c9-6cb90c206398
Boot ID: f48add6f-46ea-4250-be71-3d14501f7055
Kernel Version: 5.4.0-31-generic
OS Image: Ubuntu 20.04 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://19.3.13
Kubelet Version: v1.19.0
Kube-Proxy Version: v1.19.0
Non-terminated Pods: (12 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
default api-gateway-6d5bc994c9-6gn7s 0 (0%) 0 (0%) 0 (0%) 0 (0%) 18h
default api-gateway-f8d8cf796-db4ld 0 (0%) 0 (0%) 0 (0%) 0 (0%) 18h
default jenkins-5f98bbffcf-9vgnp 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d6h
default mssqldb-7f6d89d556-v77cm 500m (25%) 0 (0%) 4G (48%) 0 (0%) 3h14m
default webapp-86c58fbfbf-dzs6p 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d5h
kube-system coredns-f9fd979d6-7jjxd 100m (5%) 0 (0%) 70Mi (0%) 170Mi (2%) 2d6h
kube-system etcd-izj6c2nzmpsiu04m4rh4gyz 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d6h
kube-system kube-apiserver-izj6c2nzmpsiu04m4rh4gyz 250m (12%) 0 (0%) 0 (0%) 0 (0%) 2d6h
kube-system kube-controller-manager-izj6c2nzmpsiu04m4rh4gyz 200m (10%) 0 (0%) 0 (0%) 0 (0%) 2d6h
kube-system kube-proxy-q9wtn 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d6h
kube-system kube-scheduler-izj6c2nzmpsiu04m4rh4gyz 100m (5%) 0 (0%) 0 (0%) 0 (0%) 2d6h
kube-system storage-provisioner 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d6h
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1150m (57%) 0 (0%)
memory 4073400320 (49%) 170Mi (2%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events: <none>
my deploy.yaml:
# PVC and PV
# What do want?
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mssql-pvc
spec:
storageClassName: mylocalstorage
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 2Gi
---
# How do we want it implemented
apiVersion: v1
kind: PersistentVolume
metadata:
name: local-storage
spec:
storageClassName: mylocalstorage
capacity:
storage: 2Gi
accessModes:
- ReadWriteOnce
hostPath:
path: "/mnt/mssqldb/"
type: DirectoryOrCreate
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: mssqldb
spec:
selector:
matchLabels:
app: mssqldb
replicas: 1
template: # template for the pods
metadata:
labels:
app: mssqldb
spec:
containers:
- name: mssqldb
image: mcr.microsoft.com/mssql/server:2019-latest
resources:
requests:
cpu: 500m
memory: 4000M
env:
- name: MSSQL_PID
value: "Standard"
- name: ACCEPT_EULA
value: "Y"
- name: MSSQL_SA_PASSWORD
value: "yourStrong(!)Password"
volumeMounts:
- name: db-persistent-storage
mountPath: /var/opt/mssql/data
volumes:
- name: db-persistent-storage
# pointer to the configuration of HOW we want the mount to be implemented
persistentVolumeClaim:
claimName: mssql-pvc
---
kind: Service
apiVersion: v1
metadata:
name: mssqldb
spec:
selector:
app: mssqldb
ports:
- name: mssql-port
port: 1433
nodePort: 30070
type: NodePort
container logs:
SQL Server 2019 will run as non-root by default.
This container is running as user mssql.
To learn more visit https://go.microsoft.com/fwlink/?linkid=2099216.
2020-09-23 11:49:12.67 Server The licensing PID was successfully processed. The new edition is [Standard Edition].
2020-09-23 11:49:13.15 Server Setup step is copying system data file 'C:\templatedata\master.mdf' to '/var/opt/mssql/data/master.mdf'.
2020-09-23 11:49:13.16 Server ERROR: Setup FAILED copying system data file 'C:\templatedata\master.mdf' to '/var/opt/mssql/data/master.mdf': 2(The system cannot find the file specified.)
ERROR: BootstrapSystemDataDirectories() failure (HRESULT 0x80070002)
any idea of workaround , I already ran of mind.
Same problem here using kubespray to initiate k8s cluster v1.19.2 on ESXi, using cilium as CNI and NFS for persistent volume.
I tried both Helm chart available at helm GitHub and the Deployment yaml file provided in Microsoft website.
also I'm not using PVSCI in ESXi as mentioned in this bug https://support.microsoft.com/en-us/help/2519834/sql-server-reports-operating-system-error-1117-i-o-device-error-on-vmw
I tried these image tags:
2019-CU8-ubuntu-16.04
2019-CU8-ubuntu-18.04
2019-latest
2019-CU6-ubuntu-16.04
2019-CU6-rhel-7.6
2019-GA-ubuntu-16.04
After writing into DB, mssql restarts and the sqlcmd shows the following:
TCP Provider: Error code 0x2746
Communication link failure
This is the container log:
2020-11-02 07:39:58.78 spid9s Error: 17053, Severity: 16, State: 1.
2020-11-02 07:39:58.78 spid9s SQLServerLogMgr::LogWriter: Operating system error 1117(The request could not be performed because of an I/O device error.) encountered.
2020-11-02 07:39:58.79 spid9s Write error during log flush.
2020-11-02 07:39:58.80 spid51 Error: 9001, Severity: 21, State: 4.
2020-11-02 07:39:58.80 spid51 The log for database 'master' is not available. Check the operating system error log for related error messages. Resolve any errors and restart the database.
2020-11-02 07:39:58.82 spid51 Error: 9001, Severity: 21, State: 5.
2020-11-02 07:39:58.82 spid51 The log for database 'master' is not available. Check the operating system error log for related error messages. Resolve any errors and restart the database.
2020-11-02 07:39:58.84 spid51 Error: 3314, Severity: 21, State: 3.
2020-11-02 07:39:58.84 spid51 During undoing of a logged operation in database 'master' (page (0:0) if any), an error occurred at log record ID (215:176:9). Typically, the specific failure is logged previously as an error in the operating system error log. Restore the database or file from a backup, or repair the database.
2020-11-02 07:39:58.86 spid51 Error: 3449, Severity: 21, State: 1.
2020-11-02 07:39:58.86 spid51 SQL Server must shut down in order to recover a database (database ID 1). The database is either a user database that could not be shut down or a system database. Restart SQL Server. If the database fails to recover after another startup, repair or restore the database.
2020-11-02 07:39:58.89 spid51 SQL Trace was stopped due to server shutdown. Trace ID = '1'. This is an informational message only; no user action is required.
This is the PVC file
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: mssql-data
annotations:
spec:
accessModes:
- ReadWriteOnce
storageClassName: "managed-nfs-storage"
resources:
requests:
storage: 8Gi
And this is my deployment file
apiVersion: apps/v1
kind: Deployment
metadata:
name: mssql-deployment
spec:
replicas: 1
selector:
matchLabels:
app: mssql
template:
metadata:
labels:
app: mssql
spec:
terminationGracePeriodSeconds: 30
hostname: mssqlinst
securityContext:
fsGroup: 10001
containers:
- name: mssql
image: mcr.microsoft.com/mssql/server:2019-latest
imagePullPolicy: IfNotPresent
ports:
- containerPort: 1433
env:
- name: MSSQL_PID
value: "Developer"
- name: ACCEPT_EULA
value: "Y"
- name: SA_PASSWORD
valueFrom:
secretKeyRef:
name: mssql
key: SA_PASSWORD
volumeMounts:
- name: mssqldb
mountPath: /var/opt/mssql
volumes:
- name: mssqldb
persistentVolumeClaim:
claimName: mssql-data
---
apiVersion: v1
kind: Service
metadata:
name: mssql-deployment
spec:
selector:
app: mssql
ports:
- protocol: TCP
port: 1433
targetPort: 1433
type: LoadBalancer
i tried to persist other volumes like backup,master and translog. same issue. Everything is working fine as long as I don't use persistence with kubernetes persistent volumes.
Update 1: Mounting Host Path persistent volume works flawlessly for me.
@twright-msft @LuisBosquez any help?
Did you have any explanation for this issue?
I have only a little update on this issue: We currently have three Kubernetes clusters running in our infrastructure. On two of them we can reproduce the issue as described. On the third the issue is gone, but we currently can not explain why. The working cluster runs on different hardware and has a different storage provider (different storage class). Unfortunately I can't tell you what is causing this problem on the two other clusters. It could be the hardware, the storage provider, the Linux distro on which the nodes are running or any other part of the infrastructure. We will investigate this further and I hope I can find the root cause but at them moment I still have no clue.
@amitmsft
@MSC02 what is the storage provider on the cluster that works?
We have the same problem here. Node run Ubuntu 20 and use NFS as persistence volume but no luck at all
@amitmsft any update?
I would suggest opening a support case with Microsoft to check what issue you are hitting. SQL Linux containers do work with persistent volumes on K8s. So there is some troubleshooting required to identify what your specific root cause is.
Same issue here, but this time launching a database clone. Using a persistent volume mounted on /var/opt/mssql, same permissions & owners as not using persistent volumes. The underlaying filesystem is XFS and the host OS is Ubuntu 20.04. I'm using Kubernetes and the volume is mounted using Heketi/GlusterFS. Tested it with different MSSQL versions (2019, 2017) and different docker base images (Ubuntu 20.04, 18.04 and 16.04):
dbcc clonedatabase (testdb, testdbcopy)
Started executing query at Line 1
Database cloning for 'testdb' has started with target as 'testdbcopy'.
Msg 5123, Level 16, State 1, Line 1
CREATE FILE encountered operating system error 31(A device attached to the system is not functioning.) while attempting to open or create the physical file '/var/opt/mssql/testdb.mdf_MSSQL_DBCC6'.
Msg 1823, Level 16, State 8, Line 1
A database snapshot cannot be created because it failed to start.
Msg 1823, Level 16, State 2, Line 1
A database snapshot cannot be created because it failed to start.
Total execution time: 00:00:00.231
It works as expected without using persistent volumes. This issue appeared before in the past during the initialization of the engine. Microsoft fixed it, so I suspect this issue could be also fixed in the same way but it must be managed by Microsoft directly into the SQL Server internal code.
We are encountering data persistence issues with our Microsoft SQL Server image tagged 2019. Despite utilizing Persistent Volume Claims (PVCs) for data storage, we have experienced multiple instances of data loss. Can anyone confirm encountering similar behavior or provide insights into potential causes?
Hello,
this may be the same problem as described in #136 and #441 but I could not find any solution which is working for us.
Setup:
I tried this with the following images:
2019-latest 2019-CU6-ubuntu-16.04 2017-CU21-ubuntu-16.04
I also tested this with the official helm chart: https://github.com/helm/charts/tree/master/stable/mssql-linux with the same results.
Also I used https://docs.microsoft.com/en-us/sql/linux/tutorial-sql-server-containers-kubernetes?view=sql-server-ver15 as an example.
Everything is working fine as long as I don't use persistence with kubernetes persistent volumes.
Cluster storage:
Storage-Hardware: Dell EMC Unity XT
Volume Claim:
Container configuration:
This configuration maps only the data directory. If I map the root directory (/var/opt/mssql), master-, log- or temp-folder, the pod crashes on startup.
With this setup the instance starts without any errors and I can reproduce the problem when creating a new database:
SLQ Server Logs:
When I check the pod\container I can see that the files are created but the file size is 0:
I know that the error could be related to our storage or persistent volume configuration but because it is working fine with other deployments including mysql, rabbitmq and redis, I am running out of ideas on this.
As mentioned before the only hint on this problem I could find was related to #136 and like petermicuch mentioned in this thread we also suspect a permission problem related to the container configuration.
TIA for every hint or idea on this problem.