vesoft-inc / nebula

A distributed, fast open-source graph database featuring horizontal scalability and high availability
https://nebula-graph.io
Apache License 2.0
10.74k stars 1.2k forks source link

Nebula resiliancey #5335

Open porscheme opened 1 year ago

porscheme commented 1 year ago

Our cluster is running in Azure. We enabled auto patching & upgrade (Kubernetes upgrade). Often times manual intervention is required when VM is stuck in upgrade.

MegaByte875 commented 1 year ago

@porscheme I have some questions about your scenario:

porscheme commented 1 year ago

Thanks @MegaByte875 for the reply Cluster version we are using is v3.3.0

@porscheme I have some questions about your scenario:

  • How does the graphd, metad, storaged pods distribute on k8s Nodes

[Porsche] Each one of the component has its own separate K8s node pool and subnet. We are using Azure Standard_L16as_v3 SKU. Auto scale disabled

  • Do you set the PDB to guarantee service availability

[Porsche] No, we are not using any PDB since Nebula official docs doesn't mention about it. Should we use PDB, can you point me to any Nebula docs?

  • Do you use SSD NVMe disks for storaged

[Porsche] Yes, Azure Standard_L16as_v3 SKU comes with two 2 TB SSD NVMe disks attached to the VM

  • Whether the partition leader exists on the storage node before upgrading

[Porsche] Yes before upgrade, the partition leader does exists on the storage node and it was balanced with other storage nodes. But during upgrade, leaders on the storage node become zero (SHOW HOSTS)

porscheme commented 1 year ago

Azure patched our K8S cluster today and the Nebula cluster was down. Nebula reliability metrics went down!!!

MegaByte875 commented 1 year ago

Here is an implementation plan, I wish will help you: