Open dmrub opened 7 months ago
Please try to update to the latest version.
It also looks like this was not a fresh install? Otherwise, why would there be any resources?
This
Resource 'pvc-80745669-9bf4-4776-9865-f6f419c57863' on node 'k8s-m2' is still in use.
Looks like the resource (which already existed) is still in use somewhere. So someone has the still mounted or similar. Clean that up first (check the resource state linstor r l
to find where it is "InUse" and see unmount it there).
I will try to upgrade to the latest version, but this is a fresh install. We plan to use Linstor in production, but before that we are doing automated testing by installing fresh Kubernetes on three VMs and then via Flux CD piraeus operator. This installation was started on Friday evening and this morning I saw the installation status and found the errors I describe in this issue.
The output of the linstor r l
:
$ kubectl exec -ti -n piraeus-datastore deploy/linstor-controller -- linstor r l
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞══════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-80745669-9bf4-4776-9865-f6f419c57863 ┊ k8s-m0 ┊ 7002 ┊ ┊ ┊ Unknown ┊ ┊
┊ pvc-80745669-9bf4-4776-9865-f6f419c57863 ┊ k8s-m2 ┊ 7002 ┊ InUse ┊ ┊ Unknown ┊ 2024-04-05 15:15:27 ┊
┊ pvc-a6a8ed01-2406-4614-8432-fdef2b2c7abe ┊ k8s-m2 ┊ 7000 ┊ InUse ┊ Ok ┊ UpToDate ┊ 2024-04-05 15:15:24 ┊
┊ pvc-b1d25fdb-8729-474b-ab0e-c031cf159d60 ┊ k8s-m0 ┊ 7001 ┊ Unused ┊ Ok ┊ TieBreaker ┊ 2024-04-05 15:16:03 ┊
┊ pvc-b1d25fdb-8729-474b-ab0e-c031cf159d60 ┊ k8s-m1 ┊ 7001 ┊ InUse ┊ Ok ┊ UpToDate ┊ 2024-04-05 15:16:04 ┊
┊ pvc-b1d25fdb-8729-474b-ab0e-c031cf159d60 ┊ k8s-m2 ┊ 7001 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2024-04-05 15:16:02 ┊
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
The PVC pvc-80745669-9bf4-4776-9865-f6f419c57863 is used by the monitoring, which cannot start:
$ kubectl get pvc -A | grep pvc-80745669-9bf4-4776-9865-f6f419c57863
monitoring kube-prometheus-stack-grafana Bound pvc-80745669-9bf4-4776-9865-f6f419c57863 10Gi RWO linstor-fast 2d17h
$ kubectl get pods -n monitoring
NAME READY STATUS RESTARTS AGE
alertmanager-kube-prometheus-stack-alertmanager-0 2/2 Running 0 35h
kube-prometheus-stack-grafana-9b8785fdd-m9nkm 0/3 Init:0/1 0 2d17h
kube-prometheus-stack-kube-state-metrics-776c898f6-qbjj9 1/1 Running 0 47h
kube-prometheus-stack-operator-696cbbfbfb-sql6s 1/1 Running 0 35h
kube-prometheus-stack-prometheus-node-exporter-d96g9 1/1 Running 0 2d17h
kube-prometheus-stack-prometheus-node-exporter-dcdh7 1/1 Running 0 2d17h
kube-prometheus-stack-prometheus-node-exporter-gfblh 1/1 Running 0 2d17h
prometheus-kube-prometheus-stack-prometheus-0 2/2 Running 0 35h
So it looks like 6610156F-8EC88-000000
indicates that mkfs failed because DRBD was not set up correctly. But in 66101520-00000-000000
we can see that the resource is apparently in use. This does not make much sense. This would indicate that something is using keeping the resource in primary without any actual disk.
Could you please try to run:
kubectl exec k8s-m2 -- drbdsetup status pvc-80745669-9bf4-4776-9865-f6f419c57863
kubectl exec k8s-m2 -- drbdsetup show pvc-80745669-9bf4-4776-9865-f6f419c57863
It looks like the CSI driver later tried to create the volume again and somehow determined that the volume already exists, which lead to it being bound. I would recommend deleting the PVC and PV and letting it be recreated.
Here is output of the commands
$ kubectl exec -n piraeus-datastore k8s-m2 -- drbdsetup status pvc-80745669-9bf4-4776-9865-f6f419c57863
pvc-80745669-9bf4-4776-9865-f6f419c57863 role:Primary
$ kubectl exec -n piraeus-datastore k8s-m2 -- drbdsetup show pvc-80745669-9bf4-4776-9865-f6f419c57863
resource "pvc-80745669-9bf4-4776-9865-f6f419c57863" {
options {
on-no-data-accessible suspend-io;
on-suspended-primary-outdated force-secondary;
}
_this_host {
node-id 0;
}
}
Ok, this looks like a bug in LINSTOR that does not properly restore the resource to secondary after the mkfs call fails. Still leaves the issue how it can be that /dev/drbd1002 does not exist at this point. I have no idea how that can happen.
To fully clean up the volume:
kubectl exec -n piraeus-datastore k8s-m2 -- drbdsetup secondary pvc-80745669-9bf4-4776-9865-f6f419c57863
Then, run linstor rd d pvc-80745669-9bf4-4776-9865-f6f419c57863
and delete PVC and PV.
Your last suggestion worked, I was able to reinstall the monitoring. What would you recommend now? Update to the latest version of piraeus Operator and create a new issue when I get a new error? What steps would help you to analyze this error?
Yes, please upgrade and see if it happens again. In case you encounter an issue, run
kubectl exec -it deploy/linstor-controller -- linstor sos-report create
Then copy the created file from the pod to your host and attach it to the issue
@WanzenBug , I am currently testing the latest version of Piraeus Operator v2.5.0 and so far the problem described in this issue has not reoccurred. However, I have just reproduced again a problem that I described in another issue: https://github.com/LINBIT/linstor-server/issues/396 . Since I never got a response in the linstor-server project, should I recreate the issue in this (piraeus-operator) project?
Yes, this is an issue more appropriate for the piraeus project.
After installing piraeus-operator I get the error message
StorageException: Failed to mkfs /dev/drbd1002
. Kubernetes version: v1.28.8 Priaeus operator: v2.3.0 Piraeus server: v1.25.1 Linstor is installed with the following satellite configurationAfter installation I get a number of errors:
Here are the error reports:
StorageException: Failed to mkfs /dev/drbd1002
============================================================
Application: LINBIT�� LINSTOR Module: Satellite Version: 1.25.1 Build ID: 918d21837aefab23c28a52e8fcb0af14033d9bcb Build time: 2023-11-20T10:09:08+00:00 Error time: 2024-04-05 15:15:30 Node: k8s-m2
============================================================
Reported error:
Description: Failed to mkfs /dev/drbd1002 Additional information: Command 'mkfs.ext4 -q -E nodiscard /dev/drbd1002' returned with exitcode 1.
Category: LinStorException Class name: StorageException Class canonical name: com.linbit.linstor.storage.StorageException Generated at: Method 'checkExitCode', Source file 'ExtCmdUtils.java', Line #69
Error message: Failed to mkfs /dev/drbd1002
Error context: An error occurred while processing resource 'Node: 'k8s-m2', Rsc: 'pvc-80745669-9bf4-4776-9865-f6f419c57863''
ErrorContext: Details: Command 'mkfs.ext4 -q -E nodiscard /dev/drbd1002' returned with exitcode 1.
Standard out:
Error message: The file /dev/drbd1002 does not exist and no size was specified.
Call backtrace:
END OF ERROR REPORT.
$ kubectl exec -ti -n piraeus-datastore deploy/linstor-controller -- linstor error-reports show 66101520-00000-000000 ERROR REPORT 66101520-00000-000000
============================================================
Application: LINBIT�� LINSTOR Module: Controller Version: 1.25.1 Build ID: 918d21837aefab23c28a52e8fcb0af14033d9bcb Build time: 2023-11-20T10:09:08+00:00 Error time: 2024-04-05 15:15:32 Node: linstor-controller-5f594b5b45-9lr8z Peer: RestClient(10.244.42.135; 'linstor-csi/v1.3.0-4077ebefbe439ee2894b782aa7914b590891d2ff')
============================================================
Reported error:
Category: RuntimeException Class name: ApiRcException Class canonical name: com.linbit.linstor.core.apicallhandler.response.ApiRcException Generated at: Method 'deleteVolumeDefinitionInTransaction', Source file 'CtrlVlmDfnDeleteApiCallHandler.java', Line #179
Error message: Resource 'pvc-80745669-9bf4-4776-9865-f6f419c57863' on node 'k8s-m2' is still in use.
Error context: Resource 'pvc-80745669-9bf4-4776-9865-f6f419c57863' on node 'k8s-m2' is still in use.
Asynchronous stage backtrace:
Call backtrace:
Suppressed exception 1 of 1:
Category: RuntimeException Class name: OnAssemblyException Class canonical name: reactor.core.publisher.FluxOnAssembly.OnAssemblyException Generated at: Method 'deleteVolumeDefinitionInTransaction', Source file 'CtrlVlmDfnDeleteApiCallHandler.java', Line #179
Error message:
Error has been observed at the following site(s): *__checkpoint ��� Delete volume definition Original Stack Trace:
Error context: Resource 'pvc-80745669-9bf4-4776-9865-f6f419c57863' on node 'k8s-m2' is still in use.
Call backtrace:
END OF ERROR REPORT.
$ kubectl exec -ti -n piraeus-datastore deploy/linstor-controller -- linstor error-reports show 66101589-E5863-000000 ERROR REPORT 66101589-E5863-000000
============================================================
Application: LINBIT�� LINSTOR Module: Satellite Version: 1.25.1 Build ID: 918d21837aefab23c28a52e8fcb0af14033d9bcb Build time: 2023-11-20T10:09:08+00:00 Error time: 2024-04-05 15:15:52 Node: k8s-m0
============================================================
Reported error:
Description: Operations on resource 'pvc-80745669-9bf4-4776-9865-f6f419c57863' were aborted Cause: Verification of resource file failed Additional information: The error reported by the runtime environment or operating system is: The external command 'drbdadm' exited with error code 10
Category: LinStorException Class name: StorageException Class canonical name: com.linbit.linstor.storage.StorageException Generated at: Method 'regenerateResFile', Source file 'DrbdLayer.java', Line #1624
Error message: Generated resource file for resource 'pvc-80745669-9bf4-4776-9865-f6f419c57863' is invalid.
Error context: An error occurred while processing resource 'Node: 'k8s-m0', Rsc: 'pvc-80745669-9bf4-4776-9865-f6f419c57863''
ErrorContext: Description: Operations on resource 'pvc-80745669-9bf4-4776-9865-f6f419c57863' were aborted Cause: Verification of resource file failed Details: The error reported by the runtime environment or operating system is: The external command 'drbdadm' exited with error code 10
Call backtrace:
Caused by:
Description: Execution of the external command 'drbdadm' failed. Cause: The external command exited with error code 10. Correction:
Check whether the command line is correct. Contact a system administrator or a developer if the command line is no longer valid for the installed version of the external program. Additional information: The full command line executed was: drbdadm --config-to-test /var/lib/linstor.d/pvc-80745669-9bf4-4776-9865-f6f419c57863.res_tmp --config-to-exclude /var/lib/linstor.d/pvc-80745669-9bf4-4776-9865-f6f419c57863.res sh-nop
The external command sent the following output data:
The external command sent the following error information: /etc/drbd.conf:54: in resource pvc-80745669-9bf4-4776-9865-f6f419c57863, on k8s-m0 { ... }: volume 0 not defined on k8s-m2 command sh-nop exited with code 10
Category: LinStorException Class name: ExtCmdFailedException Class canonical name: com.linbit.extproc.ExtCmdFailedException Generated at: Method 'execute', Source file 'DrbdAdm.java', Line #642
Error message: The external command 'drbdadm' exited with error code 10
ErrorContext: Description: Execution of the external command 'drbdadm' failed. Cause: The external command exited with error code 10. Correction: - Check whether the external program is operating properly.
The external command sent the following output data:
The external command sent the following error information: /etc/drbd.conf:54: in resource pvc-80745669-9bf4-4776-9865-f6f419c57863, on k8s-m0 { ... }: volume 0 not defined on k8s-m2 command sh-nop exited with code 10
Call backtrace:
END OF ERROR REPORT.
ERROR REPORT 66101520-00000-000004
============================================================
Application: LINBIT�� LINSTOR Module: Controller Version: 1.25.1 Build ID: 918d21837aefab23c28a52e8fcb0af14033d9bcb Build time: 2023-11-20T10:09:08+00:00 Error time: 2024-04-05 15:16:09 Node: linstor-controller-5f594b5b45-9lr8z Peer: RestClient(10.244.42.135; 'linstor-csi/v1.3.0-4077ebefbe439ee2894b782aa7914b590891d2ff')
============================================================
Reported error:
Category: RuntimeException Class name: ApiRcException Class canonical name: com.linbit.linstor.core.apicallhandler.response.ApiRcException Generated at: Method 'handleAnswer', Source file 'CommonMessageProcessor.java', Line #346
Error message: (Node: 'k8s-m2') Generated resource file for resource 'pvc-80745669-9bf4-4776-9865-f6f419c57863' is invalid.
Error context: (Node: 'k8s-m2') Generated resource file for resource 'pvc-80745669-9bf4-4776-9865-f6f419c57863' is invalid.
Asynchronous stage backtrace:
Call backtrace:
Suppressed exception 1 of 1:
Category: RuntimeException Class name: OnAssemblyException Class canonical name: reactor.core.publisher.FluxOnAssembly.OnAssemblyException Generated at: Method 'handleAnswer', Source file 'CommonMessageProcessor.java', Line #346
Error message:
Error has been observed at the following site(s): *__checkpoint ��� Modify resource-definition Original Stack Trace:
Error context: (Node: 'k8s-m2') Generated resource file for resource 'pvc-80745669-9bf4-4776-9865-f6f419c57863' is invalid.
Call backtrace:
END OF ERROR REPORT.
ERROR REPORT 6610156F-8EC88-000004
============================================================
Application: LINBIT�� LINSTOR Module: Satellite Version: 1.25.1 Build ID: 918d21837aefab23c28a52e8fcb0af14033d9bcb Build time: 2023-11-20T10:09:08+00:00 Error time: 2024-04-05 15:16:12 Node: k8s-m2
============================================================
Reported error:
Description: Operations on resource 'pvc-80745669-9bf4-4776-9865-f6f419c57863' were aborted Cause: Verification of resource file failed Additional information: The error reported by the runtime environment or operating system is: The external command 'drbdadm' exited with error code 10
Category: LinStorException Class name: StorageException Class canonical name: com.linbit.linstor.storage.StorageException Generated at: Method 'regenerateResFile', Source file 'DrbdLayer.java', Line #1624
Error message: Generated resource file for resource 'pvc-80745669-9bf4-4776-9865-f6f419c57863' is invalid.
Error context: An error occurred while processing resource 'Node: 'k8s-m2', Rsc: 'pvc-80745669-9bf4-4776-9865-f6f419c57863''
ErrorContext: Description: Operations on resource 'pvc-80745669-9bf4-4776-9865-f6f419c57863' were aborted Cause: Verification of resource file failed Details: The error reported by the runtime environment or operating system is: The external command 'drbdadm' exited with error code 10
Call backtrace:
Caused by:
Description: Execution of the external command 'drbdadm' failed. Cause: The external command exited with error code 10. Correction:
Check whether the command line is correct. Contact a system administrator or a developer if the command line is no longer valid for the installed version of the external program. Additional information: The full command line executed was: drbdadm --config-to-test /var/lib/linstor.d/pvc-80745669-9bf4-4776-9865-f6f419c57863.res_tmp --config-to-exclude /var/lib/linstor.d/pvc-80745669-9bf4-4776-9865-f6f419c57863.res sh-nop
The external command sent the following output data:
The external command sent the following error information: /etc/drbd.conf:54: in resource pvc-80745669-9bf4-4776-9865-f6f419c57863, on k8s-m2 { ... }: volume 0 missing (present on k8s-m0) command sh-nop exited with code 10
Category: LinStorException Class name: ExtCmdFailedException Class canonical name: com.linbit.extproc.ExtCmdFailedException Generated at: Method 'execute', Source file 'DrbdAdm.java', Line #642
Error message: The external command 'drbdadm' exited with error code 10
ErrorContext: Description: Execution of the external command 'drbdadm' failed. Cause: The external command exited with error code 10. Correction: - Check whether the external program is operating properly.
The external command sent the following output data:
The external command sent the following error information: /etc/drbd.conf:54: in resource pvc-80745669-9bf4-4776-9865-f6f419c57863, on k8s-m2 { ... }: volume 0 missing (present on k8s-m0) command sh-nop exited with code 10
Call backtrace:
END OF ERROR REPORT.
k8s-m0: PV VG Fmt Attr PSize PFree
/dev/sda2 vg00 lvm2 a-- <99,50g <49,50g /dev/sdb vg01 lvm2 a-- <50,00g 516,00m
VG #PV #LV #SN Attr VSize VFree
vg00 1 1 0 wz--n- <99,50g <49,50g vg01 1 2 0 wz--n- <50,00g 516,00m
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert root vg00 -wi-ao---- 50,00g
linstor vg01 twi-aotz-- 49,39g 0,01 10,44
pvc-80745669-9bf4-4776-9865-f6f419c57863_00000 vg01 Vwi-a-tz-- 10,00g linstor 0,01
k8s-m1: PV VG Fmt Attr PSize PFree
/dev/sda2 vg00 lvm2 a-- <99,50g <49,50g /dev/sdb vg01 lvm2 a-- <50,00g 516,00m
VG #PV #LV #SN Attr VSize VFree
vg00 1 1 0 wz--n- <99,50g <49,50g vg01 1 2 0 wz--n- <50,00g 516,00m
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert root vg00 -wi-ao---- 50,00g
linstor vg01 twi-aotz-- 49,39g 0,43 10,58
pvc-b1d25fdb-8729-474b-ab0e-c031cf159d60_00000 vg01 Vwi-aotz-- 8,00g linstor 2,68
k8s-m2: PV VG Fmt Attr PSize PFree
/dev/sda2 vg00 lvm2 a-- <99,50g <49,50g /dev/sdb vg01 lvm2 a-- <50,00g 516,00m
VG #PV #LV #SN Attr VSize VFree
vg00 1 1 0 wz--n- <99,50g <49,50g vg01 1 3 0 wz--n- <50,00g 516,00m
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert root vg00 -wi-ao---- 50,00g
linstor vg01 twi-aotz-- 49,39g 0,83 10,70
pvc-a6a8ed01-2406-4614-8432-fdef2b2c7abe_00000 vg01 Vwi-aotz-- 5,00g linstor 2,91
pvc-b1d25fdb-8729-474b-ab0e-c031cf159d60_00000 vg01 Vwi-aotz-- 8,00g linstor 3,28