hpe-storage / truenas-csp

TrueNAS Container Storage Provider for HPE CSI Driver for Kubernetes
https://scod.hpedev.io
MIT License
65 stars 8 forks source link

issues reusing pv/pvc after pod reboot #12

Closed travnewmatic closed 3 years ago

travnewmatic commented 3 years ago
Events:
  Type     Reason              Age                From                     Message
  ----     ------              ----               ----                     -------
  Normal   Scheduled           40s                default-scheduler        Successfully assigned postgresql/postgresql-postgresql-0 to k3s0
  Warning  FailedAttachVolume  37s (x3 over 39s)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-826a81fa-6678-4ad0-a158-eb7e37db9657" : rpc error: code = Internal desc = Failed to add ACL to volume nvme0_csi-volumes_pvc-826a81fa-6678-4ad0-a158-eb7e37db9657 for node &{ k3s0 5a59a5cd-4f1c-6b33-7330-000000000000 [0xc0007a8a00] [0xc000976e40 0xc000976e50 0xc000976e60] []  } via CSP, err: Request failed with status code 500 and errors Error code (Exception) and message (Traceback (most recent call last):
  File "/app/truenascsp.py", line 148, in on_put
    'serial_number': extent.get('naa').lstrip('0x'),
AttributeError: 'NoneType' object has no attribute 'lstrip'
)
  Warning  FailedAttachVolume  35s  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-826a81fa-6678-4ad0-a158-eb7e37db9657" : rpc error: code = Internal desc = Failed to add ACL to volume nvme0_csi-volumes_pvc-826a81fa-6678-4ad0-a158-eb7e37db9657 for node &{ k3s0 5a59a5cd-4f1c-6b33-7330-000000000000 [0xc000470aa0] [0xc000682500 0xc000682510 0xc000682520] []  } via CSP, err: Request failed with status code 500 and errors Error code (Exception) and message (Traceback (most recent call last):
  File "/app/truenascsp.py", line 148, in on_put
    'serial_number': extent.get('naa').lstrip('0x'),
AttributeError: 'NoneType' object has no attribute 'lstrip'
)
  Warning  FailedAttachVolume  23s (x2 over 31s)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-826a81fa-6678-4ad0-a158-eb7e37db9657" : rpc error: code = Internal desc = Failed to add ACL to volume nvme0_csi-volumes_pvc-826a81fa-6678-4ad0-a158-eb7e37db9657 for node &{ k3s0 5a59a5cd-4f1c-6b33-7330-000000000000 [0xc00068d4a0] [0xc0008d0500 0xc0008d0510 0xc0008d0520] []  } via CSP, err: Request failed with status code 500 and errors Error code (Exception) and message (Traceback (most recent call last):
  File "/app/truenascsp.py", line 148, in on_put
    'serial_number': extent.get('naa').lstrip('0x'),
AttributeError: 'NoneType' object has no attribute 'lstrip'
)
  Warning  FailedAttachVolume  7s  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-826a81fa-6678-4ad0-a158-eb7e37db9657" : rpc error: code = Internal desc = Failed to add ACL to volume nvme0_csi-volumes_pvc-826a81fa-6678-4ad0-a158-eb7e37db9657 for node &{ k3s0 5a59a5cd-4f1c-6b33-7330-000000000000 [0xc000922930] [0xc000683000 0xc000683010 0xc000683020] []  } via CSP, err: Request failed with status code 500 and errors Error code (Exception) and message (Traceback (most recent call last):
  File "/app/truenascsp.py", line 148, in on_put
    'serial_number': extent.get('naa').lstrip('0x'),
AttributeError: 'NoneType' object has no attribute 'lstrip'
)

any tips on this?

travnewmatic commented 3 years ago

i hate to complain, but this is so unbelievably painful. first it happened with my minio, so i ferried everything out of that PV into the minio server in TrueNAS.

now postgres, redis, and mysql are down, so none of my services work.

i know this is a free product, and it doesnt come with a warranty. when this thing works, it works very well!

but when it doesnt... life is not so great :\

not sure if others are experiencing the same problem, but if there's any other way i can help, please let me know!

travnewmatic commented 3 years ago

to troubleshoot, i've restarted both my TruenNAS appliance, and my kubernetes node. this appears to have no effect on any of the misbehaving volumes.

travnewmatic commented 3 years ago

tried deleting and recreating the pv and pvc for the volumes, didnt work.

i've also tried disabling/reenabling the extent in truenas to no avail

travnewmatic commented 3 years ago

here's some logs from k logs -f hpe-csi-controller-56bc8cdd4-k9v5t -c csi-attacher | grep pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb

I0308 10:41:59.970916       1 csi_handler.go:598] CSIHandler: processing PV "pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb": no deletion timestamp, ignoring
I0308 10:41:59.971750       1 connection.go:183] GRPC request: {"node_id":"5a59a5cd-4f1c-6b33-7330-000000000000","secrets":"***stripped***","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}},"volume_context":{"allowOverrides":"sparse,compression,deduplication,volblocksize,sync,description","compression":"LZ4","deduplication":"OFF","fsType":"ext4","root":"pool0/csi-volumes","storage.kubernetes.io/csiProvisionerIdentity":"1612770601328-8081-csi.hpe.com","sync":"STANDARD","targetScope":"volume","volblocksize":"8K","volumeAccessMode":"mount"},"volume_id":"pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb"}
I0308 10:42:03.484858       1 connection.go:186] GRPC error: rpc error: code = Internal desc = Failed to add ACL to volume pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb for node &{ k3s0 5a59a5cd-4f1c-6b33-7330-000000000000 [0xc0006bd320] [0xc000709140 0xc000709150 0xc000709160] []  } via CSP, err: Request failed with status code 500 and errors Error code (Exception) and message (Traceback (most recent call last):
I0308 10:42:03.488421       1 csi_handler.go:222] Error processing "csi-a00a62e75930374aa58e9f0befc11792ff595c0dcd0ba7f5d1889940e2114bd9": failed to attach: rpc error: code = Internal desc = Failed to add ACL to volume pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb for node &{ k3s0 5a59a5cd-4f1c-6b33-7330-000000000000 [0xc0006bd320] [0xc000709140 0xc000709150 0xc000709160] []  } via CSP, err: Request failed with status code 500 and errors Error code (Exception) and message (Traceback (most recent call last):
I0308 10:44:00.145298       1 csi_handler.go:332] PV finalizer is already set on "pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb"
I0308 10:44:00.148045       1 connection.go:183] GRPC request: {"node_id":"5a59a5cd-4f1c-6b33-7330-000000000000","secrets":"***stripped***","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}},"volume_context":{"allowOverrides":"sparse,compression,deduplication,volblocksize,sync,description","compression":"LZ4","deduplication":"OFF","fsType":"ext4","root":"pool0/csi-volumes","storage.kubernetes.io/csiProvisionerIdentity":"1612770601328-8081-csi.hpe.com","sync":"STANDARD","targetScope":"volume","volblocksize":"8K","volumeAccessMode":"mount"},"volume_id":"pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb"}
I0308 10:44:01.487843       1 connection.go:186] GRPC error: rpc error: code = Internal desc = Failed to add ACL to volume pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb for node &{ k3s0 5a59a5cd-4f1c-6b33-7330-000000000000 [0xc0006bde40] [0xc0008944c0 0xc0008944d0 0xc0008944e0] []  } via CSP, err: Request failed with status code 500 and errors Error code (Exception) and message (Traceback (most recent call last):
I0308 10:44:01.491140       1 csi_handler.go:222] Error processing "csi-a00a62e75930374aa58e9f0befc11792ff595c0dcd0ba7f5d1889940e2114bd9": failed to attach: rpc error: code = Internal desc = Failed to add ACL to volume pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb for node &{ k3s0 5a59a5cd-4f1c-6b33-7330-000000000000 [0xc0006bde40] [0xc0008944c0 0xc0008944d0 0xc0008944e0] []  } via CSP, err: Request failed with status code 500 and errors Error code (Exception) and message (Traceback (most recent call last):
I0308 10:49:01.491425       1 csi_handler.go:332] PV finalizer is already set on "pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb"
I0308 10:49:01.494429       1 connection.go:183] GRPC request: {"node_id":"5a59a5cd-4f1c-6b33-7330-000000000000","secrets":"***stripped***","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}},"volume_context":{"allowOverrides":"sparse,compression,deduplication,volblocksize,sync,description","compression":"LZ4","deduplication":"OFF","fsType":"ext4","root":"pool0/csi-volumes","storage.kubernetes.io/csiProvisionerIdentity":"1612770601328-8081-csi.hpe.com","sync":"STANDARD","targetScope":"volume","volblocksize":"8K","volumeAccessMode":"mount"},"volume_id":"pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb"}
I0308 10:49:02.824599       1 connection.go:186] GRPC error: rpc error: code = Internal desc = Failed to add ACL to volume pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb for node &{ k3s0 5a59a5cd-4f1c-6b33-7330-000000000000 [0xc000662af0] [0xc0007081c0 0xc0007081d0 0xc0007081e0] []  } via CSP, err: Request failed with status code 500 and errors Error code (Exception) and message (Traceback (most recent call last):
I0308 10:49:02.827512       1 csi_handler.go:222] Error processing "csi-a00a62e75930374aa58e9f0befc11792ff595c0dcd0ba7f5d1889940e2114bd9": failed to attach: rpc error: code = Internal desc = Failed to add ACL to volume pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb for node &{ k3s0 5a59a5cd-4f1c-6b33-7330-000000000000 [0xc000662af0] [0xc0007081c0 0xc0007081d0 0xc0007081e0] []  } via CSP, err: Request failed with status code 500 and errors Error code (Exception) and message (Traceback (most recent call last):
I0308 10:51:59.970103       1 csi_handler.go:332] PV finalizer is already set on "pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb"
I0308 10:51:59.971032       1 controller.go:235] Started PV processing "pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb"
I0308 10:51:59.971037       1 csi_handler.go:594] CSIHandler: processing PV "pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb"
I0308 10:51:59.971043       1 csi_handler.go:598] CSIHandler: processing PV "pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb": no deletion timestamp, ignoring
I0308 10:51:59.972549       1 connection.go:183] GRPC request: {"node_id":"5a59a5cd-4f1c-6b33-7330-000000000000","secrets":"***stripped***","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}},"volume_context":{"allowOverrides":"sparse,compression,deduplication,volblocksize,sync,description","compression":"LZ4","deduplication":"OFF","fsType":"ext4","root":"pool0/csi-volumes","storage.kubernetes.io/csiProvisionerIdentity":"1612770601328-8081-csi.hpe.com","sync":"STANDARD","targetScope":"volume","volblocksize":"8K","volumeAccessMode":"mount"},"volume_id":"pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb"}
I0308 10:52:03.712566       1 connection.go:186] GRPC error: rpc error: code = Internal desc = Failed to add ACL to volume pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb for node &{ k3s0 5a59a5cd-4f1c-6b33-7330-000000000000 [0xc0006fa0c0] [0xc000047140 0xc000047150 0xc000047160] []  } via CSP, err: Request failed with status code 500 and errors Error code (Exception) and message (Traceback (most recent call last):
I0308 10:52:03.716687       1 csi_handler.go:222] Error processing "csi-a00a62e75930374aa58e9f0befc11792ff595c0dcd0ba7f5d1889940e2114bd9": failed to attach: rpc error: code = Internal desc = Failed to add ACL to volume pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb for node &{ k3s0 5a59a5cd-4f1c-6b33-7330-000000000000 [0xc0006fa0c0] [0xc000047140 0xc000047150 0xc000047160] []  } via CSP, err: Request failed with status code 500 and errors Error code (Exception) and message (Traceback (most recent call last):
I0308 10:54:02.827899       1 csi_handler.go:332] PV finalizer is already set on "pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb"
I0308 10:54:02.830888       1 connection.go:183] GRPC request: {"node_id":"5a59a5cd-4f1c-6b33-7330-000000000000","secrets":"***stripped***","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}},"volume_context":{"allowOverrides":"sparse,compression,deduplication,volblocksize,sync,description","compression":"LZ4","deduplication":"OFF","fsType":"ext4","root":"pool0/csi-volumes","storage.kubernetes.io/csiProvisionerIdentity":"1612770601328-8081-csi.hpe.com","sync":"STANDARD","targetScope":"volume","volblocksize":"8K","volumeAccessMode":"mount"},"volume_id":"pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb"}
I0308 10:54:04.421045       1 connection.go:186] GRPC error: rpc error: code = Internal desc = Failed to add ACL to volume pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb for node &{ k3s0 5a59a5cd-4f1c-6b33-7330-000000000000 [0xc0006fb8c0] [0xc00082f100 0xc00082f110 0xc00082f120] []  } via CSP, err: Request failed with status code 500 and errors Error code (Exception) and message (Traceback (most recent call last):
I0308 10:54:04.424571       1 csi_handler.go:222] Error processing "csi-a00a62e75930374aa58e9f0befc11792ff595c0dcd0ba7f5d1889940e2114bd9": failed to attach: rpc error: code = Internal desc = Failed to add ACL to volume pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb for node &{ k3s0 5a59a5cd-4f1c-6b33-7330-000000000000 [0xc0006fb8c0] [0xc00082f100 0xc00082f110 0xc00082f120] []  } via CSP, err: Request failed with status code 500 and errors Error code (Exception) and message (Traceback (most recent call last):
I0308 10:59:04.424859       1 csi_handler.go:332] PV finalizer is already set on "pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb"
I0308 10:59:04.428183       1 connection.go:183] GRPC request: {"node_id":"5a59a5cd-4f1c-6b33-7330-000000000000","secrets":"***stripped***","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}},"volume_context":{"allowOverrides":"sparse,compression,deduplication,volblocksize,sync,description","compression":"LZ4","deduplication":"OFF","fsType":"ext4","root":"pool0/csi-volumes","storage.kubernetes.io/csiProvisionerIdentity":"1612770601328-8081-csi.hpe.com","sync":"STANDARD","targetScope":"volume","volblocksize":"8K","volumeAccessMode":"mount"},"volume_id":"pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb"}
I0308 10:59:05.769993       1 connection.go:186] GRPC error: rpc error: code = Internal desc = Failed to add ACL to volume pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb for node &{ k3s0 5a59a5cd-4f1c-6b33-7330-000000000000 [0xc000832f00] [0xc000047f80 0xc000047f90 0xc000047fa0] []  } via CSP, err: Request failed with status code 500 and errors Error code (Exception) and message (Traceback (most recent call last):
I0308 10:59:05.773306       1 csi_handler.go:222] Error processing "csi-a00a62e75930374aa58e9f0befc11792ff595c0dcd0ba7f5d1889940e2114bd9": failed to attach: rpc error: code = Internal desc = Failed to add ACL to volume pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb for node &{ k3s0 5a59a5cd-4f1c-6b33-7330-000000000000 [0xc000832f00] [0xc000047f80 0xc000047f90 0xc000047fa0] []  } via CSP, err: Request failed with status code 500 and errors Error code (Exception) and message (Traceback (most recent call last):
travnewmatic commented 3 years ago

heres some logs from k logs -f truenas-csp-7c7d98c5d-grq7q | grep pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb

Mon, 08 Mar 2021 10:12:00 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:12:03 +0000 backend INFO Volume unpublished: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:15:58 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:15:59 +0000 backend INFO Volume unpublished: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:18:15 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:18:15 +0000 backend INFO Volume unpublished: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:18:54 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:18:55 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:18:57 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:19:01 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:19:10 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:19:26 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:20:00 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:20:01 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:20:04 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:20:08 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:20:16 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:20:32 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:21:24 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:21:25 +0000 backend INFO Volume unpublished: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:21:26 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:21:27 +0000 backend INFO Volume unpublished: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:21:27 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:21:28 +0000 backend INFO Volume unpublished: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:21:29 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:21:30 +0000 backend INFO Volume unpublished: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:21:39 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:21:39 +0000 backend INFO Volume unpublished: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:21:56 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:21:57 +0000 backend INFO Volume unpublished: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:22:00 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:22:03 +0000 backend INFO Volume unpublished: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:22:29 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:22:30 +0000 backend INFO Volume unpublished: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:24:39 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:24:39 +0000 backend INFO Volume unpublished: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:28:56 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:28:57 +0000 backend INFO Volume unpublished: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:32:00 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:32:03 +0000 backend INFO Volume unpublished: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:33:57 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:33:58 +0000 backend INFO Volume unpublished: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:38:59 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:38:59 +0000 backend INFO Volume unpublished: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:42:00 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:42:03 +0000 backend INFO Volume unpublished: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:44:00 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:44:01 +0000 backend INFO Volume unpublished: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:49:01 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:49:02 +0000 backend INFO Volume unpublished: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:52:00 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:52:03 +0000 backend INFO Volume unpublished: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:54:03 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:54:04 +0000 backend INFO Volume unpublished: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:59:04 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 10:59:05 +0000 backend INFO Volume unpublished: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 11:02:00 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 11:02:03 +0000 backend INFO Volume unpublished: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
travnewmatic commented 3 years ago

heres some recent logs from k logs -f truenas-csp-7c7d98c5d-grq7q

Mon, 08 Mar 2021 11:02:05 +0000 backend ERROR Exception: Traceback (most recent call last):
  File "/app/truenascsp.py", line 148, in on_put
    'serial_number': extent.get('naa').lstrip('0x'),
AttributeError: 'NoneType' object has no attribute 'lstrip'

Mon, 08 Mar 2021 11:02:31 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-056bc194-047d-4d66-8b91-55f922da3f0b
Mon, 08 Mar 2021 11:02:33 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-e0176a83-a87e-462e-8457-830998630492
Mon, 08 Mar 2021 11:02:35 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-a488756b-72cb-42b1-b74b-15e9a4727ce8
Mon, 08 Mar 2021 11:02:50 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-7c561bf2-8af1-4248-b72a-db7f7e384af0
Mon, 08 Mar 2021 11:02:55 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-e3d6abd6-e86b-4dc9-bf27-5c637f84cca0
Mon, 08 Mar 2021 11:03:04 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-dbcf41a5-ca4b-4937-89b5-4b3994ff36ab
Mon, 08 Mar 2021 11:03:08 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-e8c5f751-6ea3-41ea-b028-4d54a9cae47e
Mon, 08 Mar 2021 11:03:09 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-e525f74e-8565-4ac1-b9d3-ee5c0bcd0ea2
Mon, 08 Mar 2021 11:03:36 +0000 backend INFO Volume found: nvme0_csi-volumes_pvc-072bafae-04d4-48c4-a127-0f55a65da967
Mon, 08 Mar 2021 11:03:50 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-056bc194-047d-4d66-8b91-55f922da3f0b
Mon, 08 Mar 2021 11:04:06 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 11:04:06 +0000 backend INFO Host updated: 5a59a5cd-4f1c-6b33-7330-000000000000
Mon, 08 Mar 2021 11:04:06 +0000 backend ERROR Backend Request (DELETE) Exception: Traceback (most recent call last):
  File "/app/backend.py", line 281, in delete
    self.req_backend.raise_for_status()
  File "/usr/lib/python3.6/site-packages/requests/models.py", line 943, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 422 Client Error: Unprocessable Entity for url: https://192.168.1.202/api/v2.0/iscsi/target/id/34

Mon, 08 Mar 2021 11:04:06 +0000 backend ERROR Backend Request (DELETE) Exception: Traceback (most recent call last):
  File "/app/backend.py", line 281, in delete
    self.req_backend.raise_for_status()
  File "/usr/lib/python3.6/site-packages/requests/models.py", line 943, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 422 Client Error: Unprocessable Entity for url: https://192.168.1.202/api/v2.0/iscsi/extent/id/34

Mon, 08 Mar 2021 11:04:06 +0000 backend INFO Volume unpublished: pool0_csi-volumes_pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb
Mon, 08 Mar 2021 11:04:06 +0000 backend ERROR Backend Request (POST) Exception: Traceback (most recent call last):
  File "/app/backend.py", line 242, in post
    self.req_backend.raise_for_status()
  File "/usr/lib/python3.6/site-packages/requests/models.py", line 943, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 422 Client Error: Unprocessable Entity for url: https://192.168.1.202/api/v2.0/iscsi/target

Mon, 08 Mar 2021 11:04:07 +0000 backend ERROR Backend Request (POST) Exception: Traceback (most recent call last):
  File "/app/backend.py", line 242, in post
    self.req_backend.raise_for_status()
  File "/usr/lib/python3.6/site-packages/requests/models.py", line 943, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 422 Client Error: Unprocessable Entity for url: https://192.168.1.202/api/v2.0/iscsi/extent

Mon, 08 Mar 2021 11:04:07 +0000 backend ERROR Backend Request (POST) Exception: Traceback (most recent call last):
  File "/app/backend.py", line 242, in post
    self.req_backend.raise_for_status()
  File "/usr/lib/python3.6/site-packages/requests/models.py", line 943, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 422 Client Error: Unprocessable Entity for url: https://192.168.1.202/api/v2.0/iscsi/targetextent

Mon, 08 Mar 2021 11:04:07 +0000 backend ERROR Exception: Traceback (most recent call last):
  File "/app/truenascsp.py", line 148, in on_put
    'serial_number': extent.get('naa').lstrip('0x'),
AttributeError: 'NoneType' object has no attribute 'lstrip'

Mon, 08 Mar 2021 11:04:18 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-e0176a83-a87e-462e-8457-830998630492
Mon, 08 Mar 2021 11:04:20 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-e8c5f751-6ea3-41ea-b028-4d54a9cae47e
Mon, 08 Mar 2021 11:04:29 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-a488756b-72cb-42b1-b74b-15e9a4727ce8
Mon, 08 Mar 2021 11:04:37 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-e3d6abd6-e86b-4dc9-bf27-5c637f84cca0
Mon, 08 Mar 2021 11:04:38 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-7c561bf2-8af1-4248-b72a-db7f7e384af0
Mon, 08 Mar 2021 11:04:50 +0000 backend INFO Volume found: pool0_csi-volumes_pvc-dbcf41a5-ca4b-4937-89b5-4b3994ff36ab

kubernetes node and truenas appliance have been running for some time now

datamattsson commented 3 years ago

This is really odd. What version of TrueNAS are you running? Did you upgrade TrueNAS and it went south? Somehow the API endpoint of extent seem broken. https://192.168.1.202/api/v2.0/iscsi/extent/id/34 and https://192.168.1.202/api/v2.0/iscsi/extent

Turning on debug logging might reveal more but be careful pasting it here as it contains your API key.

If you can give me your exact TrueNAS version and how many nodes and PVs you have, I'll try reproduce it tonight (this is not my day job).

Edit: I just wonder if it's as easy as the iSCSI service simply not being started on your TrueNAS?

travnewmatic commented 3 years ago

hey thanks so much for your response!

What version of TrueNAS are you running?

Version: TrueNAS-12.0-U2.1

Did you upgrade TrueNAS and it went south?

i did upgrade my TrueNAS recently, and it went smoothly. From what i recall, I stopped my kubernetes node when i did the TrueNAS upgrade, rebooted the truenas node, and then started the kubernetes node again. At that point, all my volumes came back properly. Things going south didnt start until i ran a helm upgrade on my databases and the containers got recreated.

Turning on debug logging might reveal more but be careful pasting it here as it contains your API key.

I think i see how to do that

how many nodes

was one node, then two nodes, than went back to one node.

and PVs you have

NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                                       STORAGECLASS   REASON   AGE
pvc-056bc194-047d-4d66-8b91-55f922da3f0b   8Gi        RWO            Delete           Bound    monica/monica                                               truenas-ssd             28d
pvc-072bafae-04d4-48c4-a127-0f55a65da967   100Gi      RWO            Delete           Bound    elasticsearch/elasticsearch-master-elasticsearch-master-0   truenas-nvme            27d
pvc-293084b3-1c85-43f5-af07-408093ebf78b   1Gi        RWO            Delete           Bound    funkwhale/funkwhale-frontend                                truenas-ssd             28d
pvc-3ae39163-1026-448c-861f-b07f7a1f531b   1Gi        RWO            Delete           Bound    prosody/prosody                                             truenas-ssd             28d
pvc-4f147b42-4d1a-4b8b-90f5-b5a502bc21f5   1Gi        RWO            Delete           Bound    bitwardenrs/bitwardenrs                                     truenas-ssd             28d
pvc-633c5390-02b7-49fb-9c4d-42f6ed67a12f   1Gi        RWO            Delete           Bound    relay/relay                                                 truenas-ssd             28d
pvc-70cc2c2f-8c25-45bc-ab2b-c1c65cb97ceb   1Gi        RWO            Delete           Bound    mattermost/mattermost-mattermost-team-edition-plugins       truenas-ssd             15h
pvc-7c561bf2-8af1-4248-b72a-db7f7e384af0   200Gi      RWO            Delete           Bound    ipfs/ipfs-ipfs-0                                            truenas-ssd             28d
pvc-826a81fa-6678-4ad0-a158-eb7e37db9657   200Gi      RWO            Delete           Bound    postgresql/data-postgresql-postgresql-0                     truenas-nvme            29d
pvc-928e5471-292a-46cf-8b49-5b5245c37f90   8Gi        RWO            Delete           Bound    mysql/data-mysql-0                                          truenas-nvme            27d
pvc-a2a6b059-eb0e-4cf3-8f06-205deb8da0ec   500Gi      RWX            Delete           Bound    media/media                                                 truenas-nfs             17d
pvc-a3a4f51a-2b36-4c54-9b0d-9dfc604eceec   20Gi       RWO            Delete           Bound    lemmy/pictrs                                                truenas-ssd             28d
pvc-a488756b-72cb-42b1-b74b-15e9a4727ce8   10Gi       RWO            Delete           Bound    peertube/peertube                                           truenas-ssd             28d
pvc-dbcf41a5-ca4b-4937-89b5-4b3994ff36ab   1Gi        RWO            Delete           Bound    freshrss/freshrss                                           truenas-ssd             18d
pvc-e0176a83-a87e-462e-8457-830998630492   1Gi        RWO            Delete           Bound    noip2/noip2                                                 truenas-ssd             28d
pvc-e3d6abd6-e86b-4dc9-bf27-5c637f84cca0   5Gi        RWO            Delete           Bound    media/jellyfin                                              truenas-ssd             17d
pvc-e525f74e-8565-4ac1-b9d3-ee5c0bcd0ea2   8Gi        RWO            Delete           Bound    wordpress/wordpress                                         truenas-ssd             29d
pvc-e8c5f751-6ea3-41ea-b028-4d54a9cae47e   8Gi        RWO            Delete           Bound    thelounge/thelounge                                         truenas-ssd             28d
pvc-f11f572c-84f8-44f3-bbde-570ccc6d3d6b   100Gi      RWO            Delete           Bound    funkwhale/funkwhale-data                                    truenas-ssd             28d
pvc-f255d8e4-fb90-47cf-a56b-2cfb59bf5988   16Gi       RWO            Delete           Bound    redis/redis-data-redis-master-0                             truenas-nvme            27d
  • redis
  • postgresql
  • mysql
  • mattermost

are having issues (though theres nothing in mattermost)

Edit: I just wonder if it's as easy as the iSCSI service simply not being started on your TrueNAS?

i just checked, iSCSI is running. I've tried rebooting the TrueNAS server since the problem started, and the same volumes have the same problems. all those other volumes get processed fine and my pods come up. Its just my poor databases that are having issues right now (and mattermost which doesnt matter... ironically)

I'll try reproduce it tonight (this is not my day job).

yes, thank you so much for investigating. I appreciate that this isn't your day job and that you have other responsibilities. this really is a fine product! other than this occasional volume rot problem, it works really really well! truenas-csp has been a game changer for me!

I'll investigate how to enable debugging and see if i can get more logs out of truenas.

datamattsson commented 3 years ago

Apologize about the debug logging, I thought that was parameterized in the YAML, but set LOG_DEBUG as an environment variable to "true" in the CSP YAML manifest.

I upgraded my TrueNAS box to 12.0-U2.1 and deployed Redis on Kubernetes and rebooted, evicted and cordoned nodes to ensure the workload could move around. This is working as expected.

I'm just curious what the actual NAA is on the extent of those PVs that are having problems? You can find this under Sharing > Block (iSCSI) > Extents. You can also delete "Targets" and "Extends" of the troublesome PVs and see if that helps, they will be re-created on on the next node publish request.

travnewmatic commented 3 years ago

aw man you beat me to it! :D thats exactly what i ended up doing:

in Kubernetes

Within the past hour I got my databases running again.

Apologize about the debug logging, I thought that was parameterized in the YAML, but set LOG_DEBUG as an environment variable to "true" in the CSP YAML manifest.

I'll play with that right now

I upgraded up TrueNAS box to 12.0-U2.1 and deployed Redis and rebooted, evicted and cordoned nodes to ensure the workload could move around. This is working as expected.

Its a very intermittent problem. Perhaps scale your workloads to zero and back up again repeatedly? thats all i can think to do to recreate the problem.

I'm just curious what the actual NAA is on the extent of those PVs that are having problems?

As i've deleted them, i dont have them anymore :( but when it happens again, i'll look for that. I'm sorry i dont have that to report now.

You can also delete "Targets" and "Extends" of the troublesome PVs and see if that helps, they will be re-created on on the next node publish request.

Is deletion of the pv and pvc also necessary? I think i tried just deleting the target and the extent but the volume still wouldn't mount.

datamattsson commented 3 years ago

Is deletion of the pv and pvc also necessary? I think i tried just deleting the target and the extent but the volume still wouldn't mount.

Deleting the PVC/PV should not be necessary. However, the back-off could be minutes before you see the workload recover if it's been retrying for a while.

travnewmatic commented 3 years ago

gotcha.

i apologize that now i dont have much more to give you to help troubleshoot this further :(

thank you for being so responsive in helping me to fix this.

if this happens again, i'll hold off on deleting things so that I/we can investigate this further.

again, really appreciate your work on this.

aside from flapping my workloads, is there anything else i can check which might help you diagnose this?

datamattsson commented 3 years ago

No worries, we'll try catch it live next time. I'm leaving my workload scaling up and down over night and we'll see what yields. If it hits again we need to inspect the extents and targets, basically why the TrueNAS API is throwing a HTTP 422 and if there's any other content besides the HTTP code that might reveal a clue.

travnewmatic commented 3 years ago

gotcha, that sounds good.

thanks again for the assistance and have a great day!