hpe-storage / truenas-csp

TrueNAS Container Storage Provider for HPE CSI Driver for Kubernetes
https://scod.hpedev.io
MIT License
65 stars 8 forks source link

Duplicate initiator exceptions on fresh systems and single NIC exceptions #41

Closed msilcher closed 1 year ago

msilcher commented 1 year ago

Hi there,

I gave truenas-csp a try but can't get a pod mount a volume. Pod description at start shows the following:

AttachVolume.Attach failed for volume "pvc-8a464e26-5d41-499a-a151-4d99f895a25c" : rpc error: code = Internal desc = Failed to add ACL to volume Data_K8s_pvc-8a464e26-5d41-499a-a151-4d99f895a25c for node &{ debian-k8s cc2cb9e8-24c5-16f2-766e-3c0059f3be1c [0xc0005a9e00] [0xc000803110 0xc000803120 0xc000803130 0xc000803140 0xc000803150 0xc000803160 0xc000803170 0xc000803180 0xc000803190 0xc0008031a0 0xc0008031b0 0xc0008031c0 0xc0008031d0] [] } via CSP, err: Request failed with status code 500 and errors Error code (Exception) and message (Traceback (most recent call last): File "/app/truenascsp.py", line 151, in on_put 'initiator': initiator.get('id') AttributeError: 'list' object has no attribute 'get'

Same message is seen on the trunas-cps provisioner pod:

Sat, 06 May 2023 17:27:05 +0000 backend INFO Volume found: Data_K8s_pvc-8a464e26-5d41-499a-a151-4d99f895a25c Sat, 06 May 2023 17:27:06 +0000 backend INFO Volume found: Data_K8s_pvc-ad63e81c-52d0-48e7-ad82-54893b606451 Sat, 06 May 2023 17:27:06 +0000 backend INFO Host updated: cc2cb9e8-24c5-16f2-766e-3c0059f3be1c Sat, 06 May 2023 17:27:06 +0000 backend INFO Host updated: cc2cb9e8-24c5-16f2-766e-3c0059f3be1c Sat, 06 May 2023 17:27:06 +0000 backend ERROR Exception: Traceback (most recent call last): File "/app/truenascsp.py", line 151, in on_put 'initiator': initiator.get('id') AttributeError: 'list' object has no attribute 'get'.

I'm using Kubernetes 1.27.1 on a debain 11.7 VM. Using latest version of hpe-storage and truenas-csp manifests. This happens on both TruneNAS Core and TrueNAS Scale (always latest versions) via iSCSI (as the guide shows).

By the way: I see the provisioner is using APIv1, is there a way to force/set APIv2? Would it make any difference?

Thank you!

datamattsson commented 1 year ago

Thanks for reporting this. It appears this call returns more than one initiator. I found one occurrence this could happen and fixed that, but this is yet another corner case.

If you don't mind, would you want to test this image: quay.io/datamattsson/truenas-csp:v2.3.0-initfix?

datamattsson commented 1 year ago

Looking at the logs there I'm assuming that you're provisioning a workload with two PVCs attached on a completely fresh system?

Another workaround is to delete one of the duplicate initiators that got created in the publishing process.

msilcher commented 1 year ago

Looking at the logs there I'm assuming that you're provisioning a workload with two PVCs attached on a completely fresh system?

Another workaround is to delete one of the duplicate initiators that got created in the publishing process.

That's correct! I just tested the storage provisioner with a pihole instance that requests 2 PVCs for the same pod.

msilcher commented 1 year ago

Thanks for reporting this. It appears this call returns more than one initiator. I found one occurrence this could happen and fixed that, but this is yet another corner case.

If you don't mind, would you want to test this image: quay.io/datamattsson/truenas-csp:v2.3.0-initfix?

Sure, I'll test it and give you a feedback!

datamattsson commented 1 year ago

That's correct! I just tested the storage provisioner with a pihole instance that requests 2 PVCs for the same pod.

Makes sense. If the initiator doesn't exist on the backend (TrueNAS) and two or more requests comes in at the same time, the same initiator gets created twice and will cause problems in the staging phase.

msilcher commented 1 year ago

Tested with new image you mentioned but still fails. I see again 2 identical initiators created: image

This time the provisioner complains about something else:

Sat, 06 May 2023 20:59:18 +0000 backend INFO Volume found: Data_K8s_pvc-ea371f8c-1175-408c-a581-ee9ca8a27215 Sat, 06 May 2023 20:59:18 +0000 backend INFO Volume found: Data_K8s_pvc-97f910f6-454d-43d4-9b47-1cdaae1233ad Sat, 06 May 2023 20:59:18 +0000 backend INFO Host updated: cc2cb9e8-24c5-16f2-766e-3c0059f3be1c Sat, 06 May 2023 20:59:18 +0000 backend INFO Host updated: cc2cb9e8-24c5-16f2-766e-3c0059f3be1c Sat, 06 May 2023 20:59:18 +0000 backend ERROR Exception: Traceback (most recent call last): File "/app/truenascsp.py", line 167, in on_put req_backend['auth_networks'] = api.ipaddrs_to_networks(discovery_ips) File "/app/backend.py", line 120, in ipaddrs_to_networks for alias in interface['aliases']: TypeError: string indices must be integers

datamattsson commented 1 year ago

Tested with new image you mentioned but still fails. I see again 2 identical initiators created:

The duplicate initiator being created is a race I don't think I can mitigate. Living with it is what the patched image fixed.

This time the provisioner complains about something else:

Oh, I think I know what this is. Either you have the hpe-csi portal misconfigured or just one IP address assigned to it?

msilcher commented 1 year ago

Tested with new image you mentioned but still fails. I see again 2 identical initiators created:

The duplicate initiator being created is a race I don't think I can mitigate. Living with it is what the patched image fixed.

This time the provisioner complains about something else:

Oh, I think I know what this is. Either you have the hpe-csi portal misconfigured or just one IP address assigned to it?

It is a homlab, only 1 IP address is assigned for TrueNAS/iSCSI portal:

image

I was not aware that there must be more than 1 IP available. It would make sense in a PROD env though. Is there a workaround for this?

P.S: I could add a second IP to TrueNAS I guess

msilcher commented 1 year ago

Tested with new image you mentioned but still fails. I see again 2 identical initiators created:

The duplicate initiator being created is a race I don't think I can mitigate. Living with it is what the patched image fixed.

This time the provisioner complains about something else:

Oh, I think I know what this is. Either you have the hpe-csi portal misconfigured or just one IP address assigned to it?

It is a homlab, only 1 IP address is assigned for TrueNAS/iSCSI portal:

image

I was not aware that there must be more than 1 IP available. It would make sense in a PROD env though. Is there a workaround for this?

P.S: I could add a second IP to TrueNAS I guess

Added a second IP to the portal but issue persists:

image

Is there somethins else I need to do on the provisioner side?

datamattsson commented 1 year ago

Added a second IP to the portal but issue persists:

I'm trying to reproduce this on my end. Do you only have one NIC on this system? (configured or not)

datamattsson commented 1 year ago

I'm trying to reproduce this on my end.

I got it broken now. It's the single NIC that is causing issues. I'll have a new image shortly.

msilcher commented 1 year ago

Added a second IP to the portal but issue persists:

I'm trying to reproduce this on my end. Do you only have one NIC on this system? (configured or not)

Yes, only one NIC at the moment. I added a second IP to the same NIC but it's still not working. Everything is running in a virtual environment, I could add a second NIC and split IPs (one per NIC) for testing purposes, but to be honest this exceeds the idea of having a simple homelab setup :)

datamattsson commented 1 year ago

Ok, this image quay.io/datamattsson/truenas-csp:v2.3.0-initfix-sif should work. You can remove your additional IP address, it's not needed.

msilcher commented 1 year ago

quay.io/datamattsson/truenas-csp:v2.3.0-initfix-sif

Yes sir!!! Pod mounted both volumes without issues. Provisioner logs are clean:

Sat, 06 May 2023 22:25:52 +0000 backend ERROR Not found: Volume with name pvc-444aa725-5c30-4984-a051-79249897721f not found. Sat, 06 May 2023 22:25:52 +0000 backend ERROR Not found: Volume with name pvc-367e95d4-3dc0-4276-bee7-0c9bb1048ce3 not found. Sat, 06 May 2023 22:25:53 +0000 backend INFO Volume created: pvc-367e95d4-3dc0-4276-bee7-0c9bb1048ce3 Sat, 06 May 2023 22:25:53 +0000 backend INFO Volume created: pvc-444aa725-5c30-4984-a051-79249897721f Sat, 06 May 2023 22:26:33 +0000 backend INFO Volume found: Data_K8s_pvc-444aa725-5c30-4984-a051-79249897721f Sat, 06 May 2023 22:26:33 +0000 backend INFO Volume found: Data_K8s_pvc-367e95d4-3dc0-4276-bee7-0c9bb1048ce3 Sat, 06 May 2023 22:26:33 +0000 backend INFO Host updated: cc2cb9e8-24c5-16f2-766e-3c0059f3be1c Sat, 06 May 2023 22:26:33 +0000 backend INFO Host updated: cc2cb9e8-24c5-16f2-766e-3c0059f3be1c Sat, 06 May 2023 22:26:34 +0000 backend INFO Volume published: Data_K8s_pvc-444aa725-5c30-4984-a051-79249897721f Sat, 06 May 2023 22:26:34 +0000 backend INFO Volume published: Data_K8s_pvc-367e95d4-3dc0-4276-bee7-0c9bb1048ce3 Sat, 06 May 2023 22:26:42 +0000 backend INFO Token created (not logged) Sat, 06 May 2023 22:26:42 +0000 backend INFO Volume found: Data_K8s_pvc-444aa725-5c30-4984-a051-79249897721f Sat, 06 May 2023 22:26:44 +0000 backend INFO Volume found: Data_K8s_pvc-367e95d4-3dc0-4276-bee7-0c9bb1048ce3

Thanks a lot for your quick support!!!

datamattsson commented 1 year ago

Thanks a lot for your quick support!!!

You're most welcome and thank you for working with me on this! These fixes will be part of the next release.

datamattsson commented 1 year ago

Fixed in #42