Closed jefflill closed 4 years ago
Let's start by how we'll think about Azure before writing a bit of a specification. Here are some good links to get started with:
https://docs.microsoft.com/en-us/azure/availability-zones/az-overview https://docs.microsoft.com/en-us/azure/virtual-machines/windows/manage-availability https://docs.microsoft.com/en-us/azure/virtual-network/virtual-networks-overview https://docs.microsoft.com/en-us/azure/virtual-machines/linux/scheduled-events https://docs.microsoft.com/en-us/azure/load-balancer/load-balancer-overview
https://docs.microsoft.com/en-us/azure/virtual-machines/windows/manage-availability
Azure provides mechanisms for deploying multiple VMs such that the chance that all of them becoming unavailable at the same time will (hopefully) be rare. VM availability can be impacted by local hardware failures in the physical rack hosting the VM, a failure of an Azure availability zone, a hardware failure of the machine hosting the VM, or a software update required by the hosting machine.
Production neonKUBE clusters will generally deploy 3 Kubernetes manager nodes. These nodes coordinate to form a highly reliable source of truth for the expected state of of the services and pods running on the cluster. Managers rely on a RAFT style leader election mechanism combined with mechanisms for reliably replicating state across the managers. RAFT requires a majority of the managers (2 out of 3 in this case) to be available for the cluster to operate properly. This means that cluster deployment must take care to ensure that only one manager at a time is at risk of failing or being updated and rebooted.
Most neonKUBE operators will also need this sort of resilience. For example, they may wish to deploy multiple web servers such that some will always be available in the face of failures or the may also be deploying a multi-node database cluster such that data is replicated to prevent a single source of failure.
Azure has several related concepts:
Region: A collection of datacenters sited close together with dedicated low-latency networking. Each datacenter is equipped with isolated power, cooling, and networking.
Availablity Zone: A unique set of one or more datacenters in a region where customer resources can be hosted such that they can be isolated from failures of other availability zones in the region. Microsoft claims that they can achieve an 99.99% SLA with resources properly distributed across zones because they can tolerate the loss of an entire datacenter.
Availability Set: This is a slightly less robust mechanism (as compared to availability zones) for managing reliability. In this case, your VMs are deployed to a specific datacenter in a region and you'll spread your VMs across multiple availability sets to limit the damage from any one hardware failure. A simplistic way of thinking about this is that a hardware rack kind of maps to an availability set and Azure could provision all the VMs in the same set onto the same rack but will ensure that VMs from different sets will never be provisioned together in a rack. This means that when one rack or availability set fails due to a local problem, the others are still very likely to continue working.
Availability sets are a component of an availability zone. Think of a zone being a datacenter and a an availability set as being the racks in that datacenter.
Microsoft claims a 99.95% SLA for this approach. This is a bit less than for availability zones because you'll lose all of your VMs if your zone datacenter fails.
Fault & Update Domain: These are concepts Azure uses to decide where to provision VMs to isolate them from infrastructure failures as well as to control when any VMs may be updated or relocated. These are both availability set concepts. Think of a fault domain as a rack and an update domain as an operation rule for Azure operations.
Any specific availability set will include 3 fault domains and 5 (up to 20) update domains. Azure automatically round-robin assigns VMs to fault and update domains as VMs are created. So if we deployed a 5 node cluster with three managers deployed first, each manager would end up in its own fault domain with the two worker nodes landing in the same fault domain as the first two managers. Because there are 5 update domains, all nodes will be in their own update domain.
When updates are required, it will be free to update and reboot all of the VMs within any single update domain at the same time but will only update a single domain at a time. Azure will also wait at least 30 minutes before moving on to the next update domain to allow services to recover. For the 5 VM cluster mentioned above, say Azure needs to update all of the VMs. This will happen something like this:
Since this is the first cloud deployment for neonKUBE, let's stop and contrast the terminology used by the three major providers so we can choose how what terminology we'll expose for neonKUBE. The hope is that we can standardize on a deployment model that can work on all clouds out-of-the-box.
Here's an article contrasting Azure and AWS: link
neonKUBE | Azure | AWS | |
---|---|---|---|
region | region | region | region |
zone | availability zone | availability zone | zone |
availability group | availability set | placement groups | placement policy |
-na- | fault/update domains | partitions | -na- |
VNet | VNet (virtual network) | VPC (virtual private cloud) | VPC (virtual private cloud) |
The region and zone concepts line up pretty well across Azure, AWS and Google Cloud. Regions map to a collection of datacenters that are close together physically as well as from a network perspective with each of these datacenters having separate power, cooling and ingress networks for fault tolerance. The zone concepts for Azure and AWS align very closely by essentially identifying the separate datacenters in a region.
This is a bit different for Google, where that do name their zones but they abstract things such that nodes deployed by different users may actually end up in different datacenters. Google does this so they can balance things across their infrastructure better and to help avoid having a popular zone fill up and prevent new VMs from deploying. This is basicially just an implementation detail though. It doesn't really impact neonKUBE.
The three clouds provide somewhat different VM placement options to help avoid single hardware point of failure as well as to optimize network throughput and latency.
Azure has the concept of availability sets where operators can create an arbitrary number of these sets and assign VMs to each. Azure will ensure that the VMs in each availability set will be provisioned across distinct groups of physical hardware they call fault domains. The number of fault domains is fixed at 3. New VMs added to an availability set will be assigned to fault domains in a round-robin manner. Fault domain help isolate hardware failures.
Azure also explicitly describes update domains. These are used to influence how Azure performs maintenance operations on VMs. The number of update domains available is assigned to your Azure account and will be a minimum of 5 domains and you can ask for your account to be upgraded up to 20. VMs added to an availability set will be added to the update domains round-robin as well. Azure guarantees that it will perform maintenance operations on only one VM at a time in an update domain and that it will wait 30 minutes between VMs to allow things to stabilize.
AWS is a bit different. Their concept is placement groups. Where as Azure availability sets are really just for managing faults and updates, AWS placement groups combine that with tuning network performance bi influencing where VMs end up being provision physically.
cluster placement groups: Deploys a set of instances within a single availability zone without any fault tolerance guarantees but this will ensure that node will have a low latency 10Gbps network between them. The recommend that all VMs be provisioned in the placement group at once and also that they all have the same instance type to avoid failures due to lack of resources.
partition placement groups: This kind of the inverse of Azure availability sets. It looks like you can define placement groups and then start VMs in each group. Placement groups can be defined to have up to 7 partitions (is this user defined?) with each partition potentially mapping to a rack or other single point of failure.. AWS will automatically assign VMs to partitions but they don't explicitly say this is round-robin. They do allow you to explicitly add VMs to specific partitions (which is cool).
spread placement groups: This just tries to spread VMs out across the cluster as much as possible and this can happen across availability zones. I don't think this makes sense for neonKUBE.
Note that AWS does have restrictions related to placement policies. For example, some instance types are not compatible.
Google has two types of placement policy: spread and compact. Spread policies will ensure that each VM runs on different hosts and will not be subject to the same power or network problems. compact tries to locate VMs as close together as possible and it looks like they may end up on the same host machines. A maximum of 8 VMs can be assigned to a spread policy and 22 VMs to a compact policy. Note that only a limited number of instance types are supported and live migration is not supported for compact. This probably isn't going to work for us.
It doesn't look like either AWS or Google make any guarantees about how VM maintenance is performed like Azure does for update domains. All three of the platforms try hard to live migrate VMs away from hardware they're going to work on, so I don't think we need to worry about this for the time being.
So the question is how to model this at the neonKUBE level. I really don't want to expose these cloud specific concepts in the cluster definition because they're pretty arcane and different. By we need to solve these problems:
managers must be automatically deployed to optimized fault tolerance.
User-defined special nodes (i.e. forming a database cluster) where the user needs more control to ensure that replicated data nodes don't end up in the same fault domain.
I'd like the user to be able to define availability groups for their cluster and then assign workers to these groups. The idea being that all of the nodes in the group would be spread out across up to several fault domains. For example, say a user had a database cluster replicating sharded data across nodes:
shard | replica nodes |
---|---|
shard0 | node-shard0-copy0, node-shard0-copy1 |
shard1 | node-shard1-copy0, node-shard1-copy1 |
shard2 | node-shard2-copy0, node-shard2-copy1 |
So we have a database split into three shards with each shard having copies on two nodes. For better availability, the user would create three availability groups, one per shard and assign the two nodes hosting each shard's data to corresponding neonKUBE availability group. This maps well to Azure and AWS but not to Google, where we just don't seem to have enough control.
Here's how this would work for each cloud platform:
Azure: We'd create a separate availability set for the manager. This will work great for up to three manager because the availability set includes 3 update domains and 5 fault domains. This means that all managers will be on different hardware and Azure will update the managers one at a time so we'll always have a quorum. This will also work fine for 5 manager clusters because they'll all be isolated in their own fault domain and although there are only 3 update domains meaning that 2 managers may be rebooted at once after maintenance, we'd still have a quorum of 3 out of the 5 managers. So life will be good.
For worker nodes, we'd create an availability set for each neonKUBE availability group defined by the cluster and we'd then assign group nodes to the availability set.
AWS: We'd deploy managers to a dedicated partition group with 5 partitions and the spread the managers across these domains.
For worker nodes, we'd create a partition group for each neonKUBE availability group with 5 partitions and spread the nodes across these.
Google: I don't think we'll be able to support neonKUBE availability groups here. They're just too restrictive and don't really do what we want. I was thinking we could at least do a spread policy for them managers, but even that would mean that the instance types allowed would be restricted.
We'll just deploy VMs without any restrictions and hope the cloud is reliable enough.
For the first cut at deploying neonKUBE to Azure, we're going to make a simplifying assumption:
The 99.95% SLA for availability sets is good enough for any single cluster. Even though the 99.99% SLA for availability zones is a bit better, we believe that it's not worth the extra complexity and we expect that most operators would be better off deploying a second cluster in a different region aspire to even better uptime. Deploying another cluster in a different availability zone in the same region would be another option.
Azure has basic and standard load balancers. This link describes the differences between them. Essentially, basic load balancers are much less capable and can support only one availability set. Basic supports up to 300 VMs and standard up to 1000 VMs. Basic also lacks several other features. We'll be deploying standard load balancers for neonKUBE. There doesn't seem to a real downside since this comes for free with modern VM types.
Each cluster will be deployed within a VNet configured using the network subnet defined by the cluster definition. Cluster node IPs will be assigned automatically or the user can specify these as well. Cluster setup will deploy the network and configure the NICs on the nodes. Then we'll deploy a standard load balancer with a public IP address. We'll want a way for users to specify a specific IP address they've already acquired from Azure.
One or more cluster nodes must have Istio configured to route ingress network traffic and these will have the node.ingress label assigned. Cluster setup will configure the load balancer to forward any external ports to all of the nodes with the *node.ingress label. neonKUBE will also reserve a range of external ports for its own use. We'll reserve ports 47000-47999. We'll use these to setup temporary SSH forwarding rules so cluster setup and other operations can establish SSH connections to specific nodes in the cluster, With 1000 ports available, we'd be able to manage a 1000 node cluster. Eventually, we may also use ports for a VPN instead.
We'll allow VMs with attached Azure drives or just the stock ephemeral drives. Back in the neonHIVE days, I allowed multiple drives to be mounted and we configured these as RAID during cluster setup via node scripts. That was important back in the day because 1TB was the maximum sized drive back then. Not such a big deal now but I don't see any reason why we can't get that working again.
The only other thing we could do is explicitly handle Azure maintenance events so we could shutdown things cleanly before Azure reboots the VM. Azure provides a special HTTP network endpoint that we'd poll and presumably we'd inform Kubernetes so that it would drain traffic from the node and then shut things down cleanly. It's very possible that Azure's Ubuntu VM image is already handling this by initiating a clean shutdown.
We'll use the standard Azure Ubuntu 20.04 VM image.
I think that's it: way simpler than the neonHIVE scheme where we needed two networks, two load balancers and multiple NICs on the managers.
Here's another link discussing the MSFT Partner Program. We'd need to do this to be able to publish VM images.
https://docs.microsoft.com/en-us/azure/marketplace/marketplace-faq-publisher-guide
Implemented for v2.2.0
Implement neonKUBE setup for Azure:
[ ] Provision
AzureHostingManager
improved verification:[x] The Canonical Ubuntu 20.04 images don't appear to include the Azure Agent: https://docs.microsoft.com/en-us/azure/virtual-machines/extensions/agent-linux#installation
JEFF UPDATE: Never mind, the agent is running, it just takes a while for that to be reported on the Azure portal.
Here's some info on using images: https://docs.microsoft.com/en-us/cli/azure/vm/image?view=azure-cli-latest More info on accepting an image offer: https://docs.microsoft.com/en-us/cli/azure/vm/image/terms?view=azure-cli-latest
/dev/sda
and alsodev/sdc
. The link below discusses this: https://docs.microsoft.com/en-us/azure/virtual-machines/troubleshooting/troubleshoot-device-names-problems