Closed ksyblast closed 7 months ago
vmcore-dmesg.txt.tar.gz Full log attached Any tips or ideas are highly appreciated
Looks like this is similar to https://github.com/LINBIT/drbd/issues/86
Hello! Thanks for the report. I guess it would be a good idea to add that information to the DRBD issue, as that seems to be the root cause.
We have seen it internally, but never been able to reproduce it reliably. Adding more context seems like a good idea.
Thanks for the answer. Should I add more details how I reproduced that?
Also, does it make sense to try with some older piraeus version? It's also reproduced with drbd 9.2.6 and piraeus v2.3.0
You could try DRBD 9.1.18.
That does mean you have to use host networking, but you already do use that.
@WanzenBug hello. There are our reproduction steps:
We have 5-nodes k8s cluster with SSD storage pools of 100 GB each (Thin LVM)
All queues are processed in 1 parallel operation: csiAttacherWorkerThreads: 1 csiProvisionerWorkerThreads: 1 csiSnapshotterWorkerThreads: 1 csiResizerWorkerThreads: 1
When such a scheme is launched in a continuous cycle, we almost invariably have several node reboots per day. The operating system is not essential; we have encountered a similar problem with various 5.x and 6.x kernels from different distributions. However, the issue is definitely reproducible on the current LTS Ubuntu 22.04.
STS spec:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: flog-generator-0
namespace: test1
spec:
podManagementPolicy: OrderedReady
replicas: 3
revisionHistoryLimit: 10
selector:
matchLabels:
app.kubernetes.io/name: flog-generator-0
serviceName: ""
template:
metadata:
creationTimestamp: null
labels:
app.kubernetes.io/name: flog-generator-0
spec:
containers:
- args:
- -c
- /srv/flog/run.sh 2>&1 | tee -a /var/log/flog/fake.log
command:
- /bin/sh
env:
- name: FLOG_BATCH_SIZE
value: "1024000"
- name: FLOG_TIME_INTERVAL
value: "1"
image: ex42zav/flog:0.4.3
imagePullPolicy: IfNotPresent
name: flog-generator
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/log/flog
name: flog-pv
- env:
- name: LOGS_DIRECTORIES
value: /var/log/flog
- name: LOGROTATE_INTERVAL
value: hourly
- name: LOGROTATE_COPIES
value: "2"
- name: LOGROTATE_SIZE
value: 500M
- name: LOGROTATE_CRONSCHEDULE
value: 0 2 * * * *
image: blacklabelops/logrotate
imagePullPolicy: Always
name: logrotate
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/log/flog
name: flog-pv
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
updateStrategy:
rollingUpdate:
partition: 0
type: RollingUpdate
volumeClaimTemplates:
- apiVersion: v1
kind: PersistentVolumeClaim
metadata:
creationTimestamp: null
name: flog-pv
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
storageClassName: linstor-r2
volumeMode: Filesystem
Thanks! You could also try switching to DRBD 9.1.18. We suspect there is a race condition introduced in the 9.2 branch.
Another idea on what might be causing the issue, with a work around in the CSI driver: https://github.com/piraeusdatastore/linstor-csi/pull/256
You might try that by using the v1.5.0-2-g16c206a
tag for the CSI image. You can edit the piraeus-operator-image-config
to change the image.
We have tested with DRBD 9.1.18. Looks like the issue is not reproduced with this version
I'm also testing 9.1.18 now. Can you tell, please, is it safe to move existing installation from 9.2.5 to 9.1.18?
Can you tell, please, is it safe to move existing installation from 9.2.5 to 9.1.18?
Yes, it is safe.
@WanzenBug it looks like v1.5.0-2-g16c206a solves the node restart problem. Will you please create a tag version with it? (maybe like 1.5.1) Also, it looks like there also problem inside DRBD, that cause crash in some conditions? Will you solve it? If you can't reproduce situation, I think, I can gave an ssh access to cluster where I can reproduce situation for you.
Thank you for testing! So just to confirm, you tested with DRBD 9.2.8 and the above CSI version and did not observe the crash?
Then it must have something to do with removing a volume from a resource, as I expected. I will use that to try to reproduce the bevahiour.
We tested this with 9.2.5 and 9.2.8, and above CSI version. Yes, there were no crash anymore.
Thank you, I'll wait for your solution.
Can you tell, will fix from v1.5.0-2-g16c206a come in 1.5.1?
Yes, there will be a 1.5.1 with that. We still intend to fix the issue in DRBD, too.
We will also test with 1.5.1 and drbd 3.2.8 when 1.5.1 is released
Just wanted to let you know that we think we have tracked down the issue, no fix yet but we should have something ready for next DRBD release.
Kubernetes v1.27.5 Bare metal nodes LVM Thinpool piraeus-operator v2.4.1 Oracle Linux 8 Kernel 5.15.0-204.147.6.2.el8uek.x86_64 + default drbd image drbd9-jammy Also reproduced with kernel 4.18 + drbd image drbd9-almalinux8
How to reproduce: Create and subsequently delete a number of volumes and attach them. I tested with about 8 pvc-s and pod-s and made around 20 operations of creation and then deletion of them. Randomly the server goes to reboot because of crash. Most often it happened during volumes deletion but also it was reproduced during a new pvc creation.
UEK kernel Makefile (/usr/src/kernels/5.15.0-204.147.6.2.el8uek.x86_64/Makefile) patched to be able to build drbd: