Nebula resiliancey - Githubissues

porscheme commented 1 year ago

Cluster config metad: 3 Storaged: 3 metad: 5 Each of the storage node has 2 X 2TB SSD NVMe Disks
Space Config VID: String (Length 20) Partition Number: 200 Replica Factor: 3

Our cluster is running in Azure. We enabled auto patching & upgrade (Kubernetes upgrade). Often times manual intervention is required when VM is stuck in upgrade.

When above happens, we cannot query nebula cluster. Is this expected, Has anyone seen this?
When VM comes back up, some times we may loose media on PVC. In this scenarios, does Nebula recover media from other replicas?

MegaByte875 commented 1 year ago

@porscheme I have some questions about your scenario:

How does the graphd, metad, storaged pods distribute on k8s Nodes
Do you set the PDB to guarantee service availability
Do you use SSD NVMe disks for storaged
Whether the partition leader exists on the storage node before upgrading

porscheme commented 1 year ago

Thanks @MegaByte875 for the reply Cluster version we are using is v3.3.0

@porscheme I have some questions about your scenario:

How does the graphd, metad, storaged pods distribute on k8s Nodes

[Porsche] Each one of the component has its own separate K8s node pool and subnet. We are using Azure Standard_L16as_v3 SKU. Auto scale disabled

Do you set the PDB to guarantee service availability

[Porsche] No, we are not using any PDB since Nebula official docs doesn't mention about it. Should we use PDB, can you point me to any Nebula docs?

Do you use SSD NVMe disks for storaged

[Porsche] Yes, Azure Standard_L16as_v3 SKU comes with two 2 TB SSD NVMe disks attached to the VM

Whether the partition leader exists on the storage node before upgrading

[Porsche] Yes before upgrade, the partition leader does exists on the storage node and it was balanced with other storage nodes. But during upgrade, leaders on the storage node become zero (SHOW HOSTS)

porscheme commented 1 year ago

Azure patched our K8S cluster today and the Nebula cluster was down. Nebula reliability metrics went down!!!

While AKS was patching, as expected we saw a stand by VM was brought up
Not sure if Nebula recognized this new VM?
I wonder, how world wide Nebula installations handling this situation?
What are the options available?
How does Tigergraph & Neo4J handle these situations, does anyone know?

MegaByte875 commented 1 year ago

Here is an implementation plan, I wish will help you:

Add new buffer nodes to the cluster that runs the specified Kubernetes version.
Cordon and drain one of the old nodes to minimize interruptions to running applications.
1. Limit the number of Pods that can be evicted at one time using a PDB to control the scale of unavailable nodes.
2. Use ValidatingAdmissionWebhook to request the cluster to perform pre-offline cleanup and preparation work before the Pod receives a deletion request.
  1. The operator controller watches for Pod change events.
  2. The operator controller starts to synchronize object states and attempts to delete the Pods that need to be evicted.
  3. The kube-apiserver calls the operator webhook interface.
  4. The webhook server requests the cluster to perform pre-offline preparation work (the request is idempotent) and checks if the preparation work is completed. If the preparation is completed, deletion is allowed. If not, deletion is denied.
  5. The process loops back to step 2 due to the control loop of the operator.
When the old nodes are completely drained, they will reset the VM image to upgrade to the new version and become buffer nodes for the next node to upgrade.
This process repeats until all nodes in the cluster are upgraded.
At the end of this process, the buffer nodes used for the upgrade will be deleted to maintain the current number of nodes and regional balance. @porscheme

vesoft-inc / nebula

Nebula resiliancey #5335