Investigate occasional activate-job errors

harryttd commented 4 years ago

I'm noticing that the activate-job is occasionally erring. Usually it restarts and the next try works. Other times all restarts fail.

Examples:

Running cmd similar to k -n tqtezos1 logs activate-job-2vmh8 -c activate to retrieve logs.

<<<<4: 500 Internal Server Error
[ { "kind": "temporary", "id": "failure",
  "msg":
    "(Invalid_argument \"Json_encoding.construct: consequence of bad union\")" } ]
Error:
(Invalid_argument "Json_encoding.construct: consequence of bad union")

<<<<2: 500 Internal Server Error
[ { "kind": "permanent", "id": "proto.006-PsCARTHA.context.storage_error",
  "missing_key": [ "rolls", "owner", "current" ], "function": "copy" } ]
Error:
Storage error:
Cannot copy undefined key 'rolls/owner/current'.

Seb sent me some code (tezos source code I believe):

let () =
  register_error_kind
    `Permanent
    ~id:"context.storage_error"
    ~title: "Storage error (fatal internal error)"
    ~description:
      "An error that should never happen unless something \
       has been deleted or corrupted in the database."

3. Sometimes I see this error:

<<<<4: 500 Internal Server Error
  [ { "kind": "temporary", "id": "failure", "msg": "Fitness too low" } ]
Error:
  Fitness too low

Seb says he's seen that one very often when trying to activate a protocol on a chain that is already activated. Could be minikube has not removed all the necessary resources/storage after I deleted the namespace and re-applied the yml. This could also be related to the second error where there is deleted and/or corrupted data.

harryttd commented 4 years ago

I noticed that even when deleting the tqtezos namespace, it appears that sometimes persistent volumes are not removed. Usually they are deleted along with the rest of the namespace. When the PV's are not removed and I applied the yaml again, I got the "fitness too low error". Which makes sense as the volume is being used which already has an activated protocol stored on it.

After manually deleting the PV's and applying yaml again, activate-job worked. Then I deleted the namespace, confirmed PV's were removed, and reapplied yaml. Now getting the rest of the activate-job errors. Deleting namespace and reapplying works again.

EDIT: Noticed that if I leave the cluster running overnight, close my mackbook, and delete the namespace next day, the PV's persist. SSH'ing into minikube shows the volumes still exist.

brandisimus commented 3 years ago

Aryeh please add documentation for this in the Development.md

oxheadalpha / tezos-k8s

Investigate occasional activate-job errors #18