piraeusdatastore / piraeus-ha-controller

High Availability Controller for stateful workloads using storage provisioned by Piraeus
Apache License 2.0
15 stars 8 forks source link

CrashLoopBackOff: failed to parse drbdsetup json: json: cannot unmarshal number #21

Closed blampe closed 2 years ago

blampe commented 2 years ago
I0919 15:12:47.660781       1 merged_client_builder.go:121] Using in-cluster configuration
I0919 15:12:47.719448       1 agent.go:92] setting up PersistentVolume informer
I0919 15:12:47.725963       1 agent.go:121] setting up Pod informer
I0919 15:12:47.726131       1 agent.go:140] setting up VolumeAttachment informer
I0919 15:12:47.752953       1 agent.go:179] version: v1.1.0
I0919 15:12:47.753039       1 agent.go:180] node: 4c
I0919 15:12:47.753064       1 agent.go:182] setting up event broadcaster
I0919 15:12:47.755290       1 agent.go:189] setting up periodic reconciliation ticker
I0919 15:12:47.758167       1 drbd.go:39] updating drbd state
I0919 15:12:47.764229       1 agent.go:224] starting reconciliation
I0919 15:12:47.764360       1 drbd.go:60] Checking if DRBD is loaded
I0919 15:12:47.764624       1 drbd.go:70] Command: drbdsetup status --json
I0919 15:12:47.764744       1 agent.go:247] managing node taints failed: own node does not exist
I0919 15:12:47.764783       1 agent.go:250] Own node taints synced
I0919 15:12:47.769233       1 reflector.go:219] Starting reflector *v1.PersistentVolume (15m0s) from pkg/mod/k8s.io/client-go@v0.24.1/tools/cache/reflector.go:167
I0919 15:12:47.769394       1 reflector.go:255] Listing and watching *v1.PersistentVolume from pkg/mod/k8s.io/client-go@v0.24.1/tools/cache/reflector.go:167
I0919 15:12:47.769413       1 reflector.go:219] Starting reflector *v1.Pod (15m0s) from pkg/mod/k8s.io/client-go@v0.24.1/tools/cache/reflector.go:167
I0919 15:12:47.769499       1 reflector.go:255] Listing and watching *v1.Pod from pkg/mod/k8s.io/client-go@v0.24.1/tools/cache/reflector.go:167
I0919 15:12:47.776946       1 reflector.go:219] Starting reflector *v1.VolumeAttachment (15m0s) from pkg/mod/k8s.io/client-go@v0.24.1/tools/cache/reflector.go:167
I0919 15:12:47.777261       1 reflector.go:255] Listing and watching *v1.VolumeAttachment from pkg/mod/k8s.io/client-go@v0.24.1/tools/cache/reflector.go:167
I0919 15:12:47.787303       1 reflector.go:219] Starting reflector *v1.Node (15m0s) from pkg/mod/k8s.io/client-go@v0.24.1/tools/cache/reflector.go:167
I0919 15:12:47.787502       1 reflector.go:255] Listing and watching *v1.Node from pkg/mod/k8s.io/client-go@v0.24.1/tools/cache/reflector.go:167
I0919 15:12:47.867921       1 agent.go:214] drbd syncer done
I0919 15:12:47.868245       1 reflector.go:225] Stopping reflector *v1.Node (15m0s) from pkg/mod/k8s.io/client-go@v0.24.1/tools/cache/reflector.go:167
I0919 15:12:47.869509       1 reflector.go:225] Stopping reflector *v1.VolumeAttachment (15m0s) from pkg/mod/k8s.io/client-go@v0.24.1/tools/cache/reflector.go:167
E0919 15:12:47.869479       1 run.go:74] "command failed" err="failed to parse drbdsetup json: json: cannot unmarshal number 18446744073709551608 into Go struct field DrbdConnection.connections.ap-in-flight of type int"
Stream closed EOF for piraeus/ha-8682e2af-piraeus-ha-controller-fxp2z (piraeus-ha-controller)

Relevant portion of drbdsetup status --json:

    {
      "peer-node-id": 0,
      "name": "4b",
      "connection-state": "Connecting",
      "congested": false,
      "peer-role": "Unknown",
      "ap-in-flight": 18446744073709551608,
      "rs-in-flight": 0,
      "peer_devices": [
        {
          "volume": 0,
          "replication-state": "Off",
          "peer-disk-state": "DUnknown",
          "peer-client": false,
          "resync-suspended": "no",
          "received": 0,
          "sent": 0,
          "out-of-sync": 0,
          "pending": 0,
          "unacked": 0,
          "has-sync-details": false,
          "has-online-verify-details": false,
          "percent-in-sync": 100.00
        } ]
    } ]
WanzenBug commented 2 years ago

Thanks for the report.

That's what I get from simply copy&pasting the json in my IDE :/

Actually, we don't need 90% of those fields, they are just a parsing hazard. I'll make a PR which removes all those fields shortly.

dansomething commented 2 years ago

fwiw, we're receiving the same error. this is what we have in the drbdsetup status json.

    {
      "peer-node-id": 1,
      "name": "srv1",
      "connection-state": "Connected", 
      "congested": false,
      "peer-role": "Secondary",
      "ap-in-flight": 18446744073709551584,
      "rs-in-flight": 0,
      "peer_devices": [
        {
          "volume": 0,
          "replication-state": "Established",
          "peer-disk-state": "UpToDate",
          "peer-client": false,
          "resync-suspended": "no",
          "received": 0,
          "sent": 964712,
          "out-of-sync": 0,
          "pending": 0,
          "unacked": 0,
          "has-sync-details": false,
          "has-online-verify-details": false,
          "percent-in-sync": 100.00
        } ]
    }

We also noticed that drbdutils stores the same field as unit64. https://github.com/LINBIT/drbd-utils/blob/master/user/v9/drbdsetup.c#L2737

We fixed the issue in short term by disconnecting and reconnecting the secondary node.

WanzenBug commented 2 years ago

Fixed version released as 1.1.1