kubernetes-retired / kube-aws

[EOL] A command-line tool to declaratively manage Kubernetes clusters on AWS
Apache License 2.0
1.12k stars 295 forks source link

Destroy do not clean up everything #59

Closed aholbreich closed 7 years ago

aholbreich commented 7 years ago

It looks like on destroy now all resources are cleaned up. What i've seen as left over is:

Create New Role Role Actions

IAM > Roles > kubernetes-master IAM > Roles > kubernetes-minion

and i believe security Groups too.. But my they are deleted meanwhile, so cannot provide names.

mumoshu commented 7 years ago

Hi @aholbreich! Execuse me but roles you've mentioned don't seem to be ones created by kube-aws.

Have you tried creating k8s clusters with kube-up.sh before? kube-up.sh seems to create roles like those. https://github.com/kubernetes/kubernetes/tree/master/cluster/aws/templates/iam

mumoshu commented 7 years ago

I'm closing this issue for the reason I've explained above but please feel free to reopen that's not correct.

aholbreich commented 7 years ago

Hmm... Maybe you've right... I've tried supergiant.io as well. But anyway the Cloudformation resorce was not deleted and id could not delete VPC for some reason...

mumoshu commented 7 years ago

Hi @aholbreich! If you've encountered that next time, would you mind sharing error messages coming from CloudFormation? I guess those can be seen in stack events.

Also note that if you've modified resources created by kube-aws's cfn stack by hand or by another script which doesn't relate to kube-aws, resource deletion would have failed.

Examples:

aholbreich commented 7 years ago

@mumoshu will do. But for moment no plans to use it again. Looking forward to documentation and more configuration flexibility.

mumoshu commented 7 years ago

@aholbreich Thanks for the feedback! Any specific feature/documentation request about that would be welcomed 👍

The point is that, I'm not quite sure what people don't know.

If you've seen which company I belong, you'll probably notice I'm just a super active kube-aws user (just an user like you!)/a primary maintainer, not a CoreOS staff.

I'm personally equipped with well-known gotcha's related to AWS, CloudFormation, etc., therefore I don't have feelings of inconvenience about kube-aws documentation yet i.e. I don't know about our users enough. So github issues with specific feature and/or documentation request is welcomed, to allow me to know more about our users :bow:

aholbreich commented 7 years ago

@mumoshu i'm not experinced with CF so far at all. But we using AWS since a couple of month. Unfortunately with ECS so far... As well i'm not an expert in kubernetes or CoreOS and at the moment i want to avoid the need for getting to know all the details of installation of that tools - my reason to use kube-aws.

We need well understood, self described and kind of well documented starting point for kubernetes not only for me but as well for "duty guys" that maybe have to scale out a cluster even i've move on to the next project. Me as well do not want' go into much details on the intermediate level.

The perfect CLI in my eyes is capable of:

I'm not sure if that are the goals of that project, I'm not sure if it possible with Cloudformation at all... But you've asked ;))

Maybe this helps you too: Feed back to kube-aws

pieterlange commented 7 years ago

Hi @aholbreich, thanks for your detailed feedback!

I agree there are still some real UX issues with the project as-is. We need to strike a balance between usability, maintainability and the need for people to make their own custom adjustments to the deployment. As far as i understand your comments here, i think the goals are aligned. One of the difficulties here is there are a lot of components to this system that take a while to grasp, especially for users who are new to both AWS and kubernetes. (me, a year ago :laughing:). I don't know how to make this easier for users except for really "one size fits all" defaults and maybe a few different walkthroughs for deployment scenarios.

The truth is most of the kubernetes community itself is still learning how to properly deploy and maintain these complex systems. Things are improving though and i'm really looking forward to cool stuff on the horizon (self-hosted kubernetes clusters, etcd operators, better UX). Feedback like this helps. Writing articles helps. Reporting bugs helps! So, thanks! :+1:

As for your comments in the article:

Didn't used my existing VPC. (Should be possible meanwhile, but didn't found out)

This is possible now with the vpcId parameter: https://github.com/coreos/kube-aws/blob/master/config/templates/cluster.yaml#L107

I miss a list of resources being created by Cloud Formation. It would be nice to see them all as a list.

I don't like reinventing the wheel. You can use the AWS console or the cli aws cloudformation describe-stack-resources --stack-name {{YourClusterName}} to list all cloudformation resources.

Documentation is not on level i've expect it to be for production readiness. This is kind of walk-through only. Didn't found configuration options supported

The configuration options are documented inside the configuration file itself (as is standard for most tools). I can understand that some options seem a bit daunting, though i'm not sure how to fix that.

Didn't see any advises what to do on typical use cases: scale-up/down, emergency, typical trouble...

This is true. I think articles from users could really help here. We could manage a list of articles written by users so they're easier to find?

Feature set is kind of limited so far. E.g. Auto Scaling is not used?.

All nodes except the etcd cluster are deployed using (fixed) autoscaling groups. Scaling the cluster is a manual tasks currently (set workerCount in cluster.yaml), but after we have node pools we can use the kubernetes autoscaler which automatically grows and shrinks the cluster on demand.

While destroying cluster some resources could not be deleted including Security groups and VPC.

This only happens when you manually referenced the security groups in other security groups and is a cloudformation limitation. The other option is to forcibly remove the resources but i think it's better to fail safely in these instances.

Hard to reuse pre-configured resources, not possible or not documented.

Can you be more specific here? What kind of resources would you like to have reused?

mumoshu commented 7 years ago

Hey @aholbreich, I greatly appreciate your detailed feedback including the nice blog post! Especially, the well-written blog post would really help new comers soon!

We need well understood, self described and kind of well documented starting point for kubernetes not only for me but as well for "duty guys" that maybe have to scale out a cluster even i've move on to the next project.

That's exactly what I need, too. I wish I could work on https://github.com/coreos/kube-aws/issues/61 sooner.

Me as well do not want' go into much details on the intermediate level.

I understand we should eventually pave the way to that.

The perfect CLI in my eyes is capable of:

  • Provision a new cluster with all parameters i've defined
  • letting everything i've not specified according to well-documented defaults.
  • Allows me reuse all/important custom predefined AWS resources
  • Allows management of several clusters in convenient way
  • Is capable to query statuses of my clusters and list out resources that are used.
  • Give me controlled verbosity, warns and guides me, documentation is build in. Self-explainable

Hey, all of those are what I'd personally like to have in kube-aws 😆

I'm not sure if that are the goals of that project, I'm not sure if it possible with Cloudformation at all...

It's ok because technically, all the things are achievable 👍

But you've asked ;))

Yes. I appreciate this kind of feedback to shape our long-term goals and plans correctly.

Our current status/actions could be taken immediately are as @pieterlange kindly explained in the above comment.

I'm also guessing that if we could add something like a GOALS.md with the The perfect CLI in my eyes is capable of: list written to it?

Anyways, thank you again for your great feedback @aholbreich :bow:

aholbreich commented 7 years ago

@pieterlange @mumoshu glad you liked my feedback and that you dealing professional with it even if parts of my feedback are not that accurate. Also sorry for not answering quite long.

Thank you both for clarifying some details, i'll go back to this if get chance to spend more time on kube-aws again. At the moment i work on ansible based provisioning but maybe i have to go back and squeeze-out everything form cloudformation, so i can contribute in better quality here.

cmcconnell1 commented 7 years ago

I've been watching email theads and was planning on trying to summarize everything that I've needed to do to get to the point where we are comfortable with kube-aws working in EC2 (at least in DEV ATM).

First of all thanks again to @mumoshu for all his and fellow contributors feedback. They were very helpful getting us going with kube-aws.

There are a lot of disparate docs/issues, etc that are out there, but it's sometimes hard to put them all together. A limiting factor when working with kube-was for me was lack of IRC presence that I'm aware of. For example, if you run into problems with either deis or helm, I can jump on their respective IRC slack channels, and within a few minutes, I can get a developer providing crucial feedback, workarounds, known-bugs/issues, etc. I have tried to find folks on the kubernetes IRC channels (kubernetes-users, sig-aws, etc.) that have experience/knowledge of kube-aws, kops, etc., but unfortunately, as soon as I mention kube-aws, kops, etc., 'will get crickets or "I don't know anything about those. . ." It would be great to have a dedicated IRC channel for kube-aws where experts to noobs could lurk about.

Regarding cloud ENV's, the important thing to keep in mind is that every user's cloud provider, network, security groups, ACL's will be different and if you think about it, how would you handle all the testing, use cases, etc? For me/us, kube-aws is the best I found, and this was after evaluating/testing: Terraform, Cloudify, kube-up, kops, etc.

There are many complicated components and requirements that are involved with any tool/framework, but if you think about what you get with kube-aws, and after looking through the 1,000+ line long CloudFormation stack templates, you get a hint of and can appreciate what these developers are having to wrangle and maintain.

With that said, what I found was with previous RC.4 release, kube-aws did not seem to have issues with my integrations of pre-existing security groups, etc. and was able to successfully destroy the stack. That changed for me when we went from kube-aws-v0.9.1-rc.4 to kube-aws-v0.9.1. For now I've just been using the CloudFormation API and manually deleting the stack (after kube-aws destroy fails to clear what I think was the kube policies and roles and perhaps load balancer components). This is an extra steps but only takes a few minutes max.

Off the top of my head (there is more and more details but a quick summary here of what I needed to do in order to get kube-aws working in my cloud ENV) post kube-aws init (Note YMMV! this works for me in my ENV, your ENV's will be different): Modify cluster.yaml to support desired features, including existing VPC, DNS, tags, CIDR ranges, etc. And note that this is all documented here https://coreos.com/kubernetes/docs/latest/kubernetes-on-aws-render.html#customizations-to-your-cluster The below works for me, see if you can try and modify this as needed for your needs (MODIFY TO SUIT YOUR CIDR RANGE, vpcid, rout-table, etc. see your cloud/network admin) Examples like this would have been very helpful for me when trying to figure all this out:

clusterName: deis-kube1
externalDNSName: deis-kube1.dev.foo
releaseChannel: stable
createRecordSet: true
recordSetTTL: 300
hostedZoneId: "XASDASDASD" # DEV private only route53 zone
keyName: my-ssh-keypair
region: us-west-1
availabilityZone: us-west-1a
kmsKeyArn: "arn:aws:kms:us-west-1:0w123456789:key/d345fcd1-c77c-4fca-acdc-asdasdf3234232"
controllerCount: 1
controllerInstanceType: m3.medium
controllerRootVolumeSize: 30
controllerRootVolumeType: gp2
workerCount: 1
workerInstanceType: m3.medium
workerRootVolumeSize: 30
workerRootVolumeType: gp2
etcdCount: 1
etcdInstanceType: m3.medium
etcdRootVolumeSize: 30
etcdDataVolumeSize: 30
vpcId: vpc-xxxccc45
routeTableId: "rtb-aaabbb12" # main external no NAT uses internet gateway
vpcCIDR: "10.1.0.0/16"
instanceCIDR: "10.1.10.0/24"
serviceCIDR: "10.3.0.0/24"
podCIDR: "10.2.0.0/16"
dnsServiceIP: 10.3.0.10
stackTags:
  Name: "deis-kube1"
  ENV: "DEV"

There was also required hacks to userdata/cloud-config/etc to force the usage of AWS DNS (not internal) which is something like this replace '%H' with your etcd master IP you get from your cluster.yaml). There are issues on github which documented this and again thanks to @mumoshu for helping direct me to them

            Environment=ETCD_NAME=ip-10-1-10-5.us-west-1.compute.internal
            Environment=ETCD_LISTEN_CLIENT_URLS=https://ip-10-1-10-5.us-west-1.compute.internal:2379
            Environment=ETCD_ADVERTISE_CLIENT_URLS=https://ip-10-1-10-5.us-west-1.compute.internal:2379
            Environment=ETCD_LISTEN_PEER_URLS=https://ip-10-1-10-5.us-west-1.compute.internal:2380
            Environment=ETCD_INITIAL_ADVERTISE_PEER_URLS=https://ip-10-1-10-5.us-west-1.compute.internal:2380

I also needed to modify my stack-template.json file to allow access to 443 for each kubernetes security group else my kube cluster etcd nodes etc never could obtain required external resources. Again, YMMV, this is what worked for me. Also AIR, EC2 SG's are stateful, so if you allow something in it will be allowed out and vice versa, but I have it specified just so I can see it and keep it straight in my head and in our SCR (git)

    "SecurityGroupTdKubernetes" : {
        "Type" : "AWS::EC2::SecurityGroup",
        "Properties" : {
            "GroupDescription" : "Required Kubernetes Access",
            "VpcId" : { "Ref" : "VpcId" },
            "SecurityGroupEgress": [
              {
                "CidrIp": "0.0.0.0/0",
                "FromPort": 443,
                "IpProtocol": "tcp",
                "ToPort": 443
              }
            ],
            "SecurityGroupIngress" : [
              {
                "CidrIp": "10.1.10.0/24",
                "FromPort": 443,
                "IpProtocol": "tcp",
                "ToPort": 443
              }
         ],
        "Tags": [
          {
            "Key": "Name",
            "Value": "{{$.ClusterName}}-sg-tdkubernetes"
          },
          {
            "Key": "KubernetesCluster",
            "Value": "{{.ClusterName}}"
          }
        ],
        "VpcId": {{.VPCRef}}
      },
      "Type": "AWS::EC2::SecurityGroup"
    },

And Then referenced my custom SG in each of the above mentioned kube-aws security groups like this:

          {
            "SourceSecurityGroupId" : { "Ref" : "SecurityGroupTdKubernetes" },
            "FromPort": 443,
            "IpProtocol": "tcp",
            "ToPort": 443
          }

There are additional modifications I need to do for other features, but hopefully this might help you or others get up and working with kube-aws within your EC2 VPC ENV. I had issues when trying to get my clusters working inside an internal subnet with NAT, so for me, I am using kube-aws on an external subnet with direct access to the internet with SG modifications to only allow ssh, etc. from our internal CIDR range. This works well enough for us for now and we can refine, etc. as we go. Hope this helps others.

cmcconnell1 commented 7 years ago

Just an update/note in case anyone else runs across similar issues; the customized cloud-init hacks I did above (specifically userdata/cloud-config/etcd) obviously won't work for multi etcd/master infrastructure environments--the below noted issue documents what we needed to do for those deployments.

Unfortunately now with multi-etcd masters and multiple AZ's, we're now having problems deploying into an AWS VPC that has a custom DHCP option set. The symptom/effect of the problem we're seeing is that the kube-aws provided hostnames are not resolvable in DNS, thereby creating Kubernetes (etcd) deployment problems, etc., as documented in the issue with the workaround https://github.com/coreos/kube-aws/issues/189.

cknowles commented 7 years ago

Just wanted to add something related to this point - While destroying cluster some resources could not be deleted including Security groups and VPC. I've often found if the k8s cluster has LoadBalancer services deployed that it fails to delete, even via the AWS console/CLI. That's due to the ELBs attaching themselves to the external network interfaces but they cannot be removed automatically as they are not part of the stack created by kube-aws. One solution may be to delete all k8s namespaces prior to CF stack deletion, although how we'd listen for a complete signal I'm not so sure.