nforgeio / neonKUBE

Public NeonKUBE Kubernetes distribution related projects
https://neonkube.io
Apache License 2.0
78 stars 13 forks source link

Implement AzureHostingManager #908

Closed jefflill closed 4 years ago

jefflill commented 4 years ago

Implement neonKUBE setup for Azure:

jefflill commented 4 years ago

Let's start by how we'll think about Azure before writing a bit of a specification. Here are some good links to get started with:

https://docs.microsoft.com/en-us/azure/availability-zones/az-overview https://docs.microsoft.com/en-us/azure/virtual-machines/windows/manage-availability https://docs.microsoft.com/en-us/azure/virtual-network/virtual-networks-overview https://docs.microsoft.com/en-us/azure/virtual-machines/linux/scheduled-events https://docs.microsoft.com/en-us/azure/load-balancer/load-balancer-overview

Availability

https://docs.microsoft.com/en-us/azure/virtual-machines/windows/manage-availability

Azure provides mechanisms for deploying multiple VMs such that the chance that all of them becoming unavailable at the same time will (hopefully) be rare. VM availability can be impacted by local hardware failures in the physical rack hosting the VM, a failure of an Azure availability zone, a hardware failure of the machine hosting the VM, or a software update required by the hosting machine.

Production neonKUBE clusters will generally deploy 3 Kubernetes manager nodes. These nodes coordinate to form a highly reliable source of truth for the expected state of of the services and pods running on the cluster. Managers rely on a RAFT style leader election mechanism combined with mechanisms for reliably replicating state across the managers. RAFT requires a majority of the managers (2 out of 3 in this case) to be available for the cluster to operate properly. This means that cluster deployment must take care to ensure that only one manager at a time is at risk of failing or being updated and rebooted.

Most neonKUBE operators will also need this sort of resilience. For example, they may wish to deploy multiple web servers such that some will always be available in the face of failures or the may also be deploying a multi-node database cluster such that data is replicated to prevent a single source of failure.

Azure has several related concepts:

Azure, AWS, and Google Compared

Since this is the first cloud deployment for neonKUBE, let's stop and contrast the terminology used by the three major providers so we can choose how what terminology we'll expose for neonKUBE. The hope is that we can standardize on a deployment model that can work on all clouds out-of-the-box.

Here's an article contrasting Azure and AWS: link

neonKUBEAzureAWSGoogle
regionregionregionregion
zoneavailability zoneavailability zonezone
availability groupavailability setplacement groupsplacement policy
-na-fault/update domainspartitions-na-
VNetVNet (virtual network)VPC (virtual private cloud)VPC (virtual private cloud)

The region and zone concepts line up pretty well across Azure, AWS and Google Cloud. Regions map to a collection of datacenters that are close together physically as well as from a network perspective with each of these datacenters having separate power, cooling and ingress networks for fault tolerance. The zone concepts for Azure and AWS align very closely by essentially identifying the separate datacenters in a region.

This is a bit different for Google, where that do name their zones but they abstract things such that nodes deployed by different users may actually end up in different datacenters. Google does this so they can balance things across their infrastructure better and to help avoid having a popular zone fill up and prevent new VMs from deploying. This is basicially just an implementation detail though. It doesn't really impact neonKUBE.

The three clouds provide somewhat different VM placement options to help avoid single hardware point of failure as well as to optimize network throughput and latency.

Azure has the concept of availability sets where operators can create an arbitrary number of these sets and assign VMs to each. Azure will ensure that the VMs in each availability set will be provisioned across distinct groups of physical hardware they call fault domains. The number of fault domains is fixed at 3. New VMs added to an availability set will be assigned to fault domains in a round-robin manner. Fault domain help isolate hardware failures.

Azure also explicitly describes update domains. These are used to influence how Azure performs maintenance operations on VMs. The number of update domains available is assigned to your Azure account and will be a minimum of 5 domains and you can ask for your account to be upgraded up to 20. VMs added to an availability set will be added to the update domains round-robin as well. Azure guarantees that it will perform maintenance operations on only one VM at a time in an update domain and that it will wait 30 minutes between VMs to allow things to stabilize.

AWS is a bit different. Their concept is placement groups. Where as Azure availability sets are really just for managing faults and updates, AWS placement groups combine that with tuning network performance bi influencing where VMs end up being provision physically.

Note that AWS does have restrictions related to placement policies. For example, some instance types are not compatible.

Google has two types of placement policy: spread and compact. Spread policies will ensure that each VM runs on different hosts and will not be subject to the same power or network problems. compact tries to locate VMs as close together as possible and it looks like they may end up on the same host machines. A maximum of 8 VMs can be assigned to a spread policy and 22 VMs to a compact policy. Note that only a limited number of instance types are supported and live migration is not supported for compact. This probably isn't going to work for us.

It doesn't look like either AWS or Google make any guarantees about how VM maintenance is performed like Azure does for update domains. All three of the platforms try hard to live migrate VMs away from hardware they're going to work on, so I don't think we need to worry about this for the time being.


So the question is how to model this at the neonKUBE level. I really don't want to expose these cloud specific concepts in the cluster definition because they're pretty arcane and different. By we need to solve these problems:

I'd like the user to be able to define availability groups for their cluster and then assign workers to these groups. The idea being that all of the nodes in the group would be spread out across up to several fault domains. For example, say a user had a database cluster replicating sharded data across nodes:

shardreplica nodes
shard0node-shard0-copy0, node-shard0-copy1
shard1node-shard1-copy0, node-shard1-copy1
shard2node-shard2-copy0, node-shard2-copy1

So we have a database split into three shards with each shard having copies on two nodes. For better availability, the user would create three availability groups, one per shard and assign the two nodes hosting each shard's data to corresponding neonKUBE availability group. This maps well to Azure and AWS but not to Google, where we just don't seem to have enough control.

Here's how this would work for each cloud platform:

AWS: We'd deploy managers to a dedicated partition group with 5 partitions and the spread the managers across these domains.

For worker nodes, we'd create a partition group for each neonKUBE availability group with 5 partitions and spread the nodes across these.

Google: I don't think we'll be able to support neonKUBE availability groups here. They're just too restrictive and don't really do what we want. I was thinking we could at least do a spread policy for them managers, but even that would mean that the instance types allowed would be restricted.

We'll just deploy VMs without any restrictions and hope the cloud is reliable enough.


neonKUBE Plan for Azure

For the first cut at deploying neonKUBE to Azure, we're going to make a simplifying assumption:

The 99.95% SLA for availability sets is good enough for any single cluster. Even though the 99.99% SLA for availability zones is a bit better, we believe that it's not worth the extra complexity and we expect that most operators would be better off deploying a second cluster in a different region aspire to even better uptime. Deploying another cluster in a different availability zone in the same region would be another option.

Azure has basic and standard load balancers. This link describes the differences between them. Essentially, basic load balancers are much less capable and can support only one availability set. Basic supports up to 300 VMs and standard up to 1000 VMs. Basic also lacks several other features. We'll be deploying standard load balancers for neonKUBE. There doesn't seem to a real downside since this comes for free with modern VM types.

Each cluster will be deployed within a VNet configured using the network subnet defined by the cluster definition. Cluster node IPs will be assigned automatically or the user can specify these as well. Cluster setup will deploy the network and configure the NICs on the nodes. Then we'll deploy a standard load balancer with a public IP address. We'll want a way for users to specify a specific IP address they've already acquired from Azure.

One or more cluster nodes must have Istio configured to route ingress network traffic and these will have the node.ingress label assigned. Cluster setup will configure the load balancer to forward any external ports to all of the nodes with the *node.ingress label. neonKUBE will also reserve a range of external ports for its own use. We'll reserve ports 47000-47999. We'll use these to setup temporary SSH forwarding rules so cluster setup and other operations can establish SSH connections to specific nodes in the cluster, With 1000 ports available, we'd be able to manage a 1000 node cluster. Eventually, we may also use ports for a VPN instead.

We'll allow VMs with attached Azure drives or just the stock ephemeral drives. Back in the neonHIVE days, I allowed multiple drives to be mounted and we configured these as RAID during cluster setup via node scripts. That was important back in the day because 1TB was the maximum sized drive back then. Not such a big deal now but I don't see any reason why we can't get that working again.

The only other thing we could do is explicitly handle Azure maintenance events so we could shutdown things cleanly before Azure reboots the VM. Azure provides a special HTTP network endpoint that we'd poll and presumably we'd inform Kubernetes so that it would drain traffic from the node and then shut things down cleanly. It's very possible that Azure's Ubuntu VM image is already handling this by initiating a clean shutdown.

We'll use the standard Azure Ubuntu 20.04 VM image.

I think that's it: way simpler than the neonHIVE scheme where we needed two networks, two load balancers and multiple NICs on the managers.

jefflill commented 4 years ago

Here's another link discussing the MSFT Partner Program. We'd need to do this to be able to publish VM images.

https://docs.microsoft.com/en-us/azure/marketplace/marketplace-faq-publisher-guide

jefflill commented 4 years ago

Implemented for v2.2.0