Provide (optional) ability to use Kubernetes as the runtime engine instead of Docker

JeanMertz commented 8 years ago

I was wondering, has there been a consideration to move drone up the stack, and have it rely on Kubernetes features to function?

It could potentially ease the burden of a lot of things Drone currently has to manage on top of Docker, and with minikube, running drone locally could become as simple as minikube create && drone deploy.

I know this is an extreme oversimplification of things, and it would obviously mean giving up some freedom (dependent on a container scheduler, instead of only Docker), but there are obviously also a lot of upsides to this.

bradrydzewski commented 8 years ago

@JeanMertz right now there are no plans for this sort of deep integration with kubernetes. I'm certainly keeping an eye on the project and where it goes in the future, but I'm not sure this would be the right decision for drone at this time.

bradrydzewski commented 8 years ago

I should point out however, that the engine used to run builds (docker) is exposed as an interface so in theory it could be swapped with a different implementation. This is the interface that is defined for running builds: https://github.com/drone/drone/blob/master/build/engine.go

And this is the docker implementation: https://github.com/drone/drone/tree/master/build/docker

I would certainly encourage a community effort to create a kubernetes implementation. I know @tboerger expressed some interest here. I've discussed with @gtaylor as well. Bottom line is I think making drone kubernetes-only would not be a wise decision for the project, but supporting multiple runtime engines is certainly of interest.

This sort of thing, of course, depends on community engagement since I cannot volunteer to take on all these tasks. So while it isn't something I would work on, I would certainly make myself available to provide technical guidance to individuals interested in contributing an implementation.

JeanMertz commented 8 years ago

That engine abstraction looks interesting.

Kubernetes has a very good first-class Golang API, and the feature set required here (Start/Stop/Remove/Wait/Logs) seems really limited, so it wouldn't be too hard to implement that on top of Kubernetes.

Maybe I'll give it a stab some time in the near future, if I ever manage to put more than 24 hours in a day.

bradrydzewski commented 8 years ago

cool, if you end up looking into an implementation give a shout-out in the drone developer channel at https://gitter.im/drone/drone-dev . I'm sure you could find some others interesting in lending a hand :)

bradrydzewski commented 8 years ago

Let's re-open this but with a slightly adjusted scope of adding experimental support for an alternate kubernetes engine, alongside the existing docker engine. I would love to hear what @gtaylor thinks about this and what might be possible.

tboerger commented 8 years ago

When I'm more familiar with the k8s codebase I really would like to give it a try to build a real k8s agent. That's something my teamlead also asked for.

gtaylor commented 8 years ago

I've got a crappy custom scheduler that can parse a .drone.yml and fire up pods with some Drone plugins working. This isn't useful for Drone itself, but served as a nice exercise to get a feel for what this would look like. A few notes about what I did:

Each build gets its own Kubernetes Namespace. This makes cleanup as easy as one DELETE call, and also means we don't need to get weird with service names (to avoid collisions). Eventually we'll be able to set network ACLs to prevent cross-Namespace interactions, and we can already set quotas to avoid resource exhaustion.
Each build's pipeline section is a Job (not a Replication Controller or Deployment), with each step being a container within the Job's Pod. Using a Job means that we don't restart the whole thing when failures occur.
In my experimentation, all of the pipeline is in one Pod so we can mount the same emptyDir volume on all of the pipeline containers.
Dependent services run in separate Pods. These are accessible by hostname.

Things I haven't got around to:

.netrc injection.
I'm not sure if multiple containers within one Pod start in order, and block on their predecessor. We wouldn't want multiple pipeline steps executing simultaneously by default. In an ideal case, we can run all pipeline steps in one Pod. In a less ideal case, we run each pipeline as a step and pass around a volume, which would potentially be much slower.
As Brad mentioned, secrets. One way to address this is for Drone to do the figuring out what should be injected into the build's Namespace as a Secrets object. In other words: the Kubernetes part should be handed the secrets that have already been resolved.
Build Namespaces should probably eventually be cleaned up. Drone can pull the logs and the build states as it goes, so the only reason to keep them around would be for troubleshooting. This is going to vary by org, but we'd probably want to keep them for 24 hours over here.

bradrydzewski commented 8 years ago

@gtaylor thanks for the detailed reply. Regarding using kubernetes secret store, I'm wondering how well that would work with drone. Thoughts on https://github.com/drone/drone/issues/1808#issuecomment-252933474 ?

gtaylor commented 8 years ago

@bradrydzewski I think you create multiple Kubernetes Secret resources, perhaps one per pipeline step (a container within the build Pod). You can pull these secrets into each container (step) individually. Drone would determine which secrets go to which steps and stuff those secrets in the step's respective Secret object.

Also note that each step container can use multiple Secret objects to pull env vars (or mount as files) from. That may or may not be useful to you.

While what I am describing above doesn't lend itself to a ton of value over what Drone has, the values won't be viewable in kubectl describe pod, unlike straight env vars. It'd be very important not to show secret values in pod descriptions.

gtaylor commented 8 years ago

Also, I don't think it'd be worth heavily pursuing deep Kubernetes integration until Kubernetes 1.5 lands. The Job system is still shaking out in major ways. Secrets are going to be seeing lots of expansion soon, as are the various network/resource ACLs. The jump from 1.3 to the recent 1.4 saw a cron-like Job scheduling system become available in alpha, so that's still super raw as well.

It'd certainly be worth tinkering with and building some familiarity, but this is going to take a good bit of thought and care to do well. We'd need to get kind of hacky with build pods and plugins to make it work well right now.

Things look super bright in the not-so-distant future.

derekperkins commented 7 years ago

@gtaylor Now that 1.5 is out, do you feel more confident about tackling this?

bradrydzewski commented 7 years ago

FWIW I am also interested, at some point, trying to figure out what a "serverless" drone would look like. I think the concept of a build queue and pending builds could be eliminated by using the on-demand capabilities of services like hyper.sh and rackspace carina. I'm sure other vendors will launch similar on-demand capabilities as well.

I'm not sure how Kubernetes fits into the picture here, but am interested in the overall concept.

gtaylor commented 7 years ago

@derekperkins It's definitely more possible now. It would still be a whole lot of work to do really well, in that the perfect situation is that we're scattering the work out across multiple Pods. Failing to achieve that means that we're not any better off than we currently are.

It's one of those things where this could be really awesome if done right, but it could also be thoroughly underwhelming and a black mark otherwise. We'd have to provide something more compelling and capable than all of these Jenkins + Kubernetes whitepapers (a well-trodden path at this point), at the very minimum.

FWIW I am also interested, at some point, trying to figure out what a "serverless" drone would look like.

It could be neat, but is there any money in that? At what point do you just run Circle CI/Travis/CodeShip/Shippable or one of the infinite other hosted solutions that are effectively "serverless" from the customer's perspective? Can't imagine the bigger money on-prem orgs using those services with their metal.

If you really don't want to maintain servers, fire up a Google Container Engine (hosted Kubernetes) cluster and install Drone. They maintain the VMs and it's cheap ($5/month at the lowest level). You can still get your fingers in if you want to have the cluster auto-scale up/down as jobs pile up, and you can mix in their equivalent of Spot instances (pre-emptible VMs). If and when you eventually want to take more direct control with your own cluster, Container Engine runs the same Kubernetes that is found in the open source project's repo.

I'm not sure how Kubernetes fits into the picture here

It's still probably a little early for Drone and Kubernetes to go down this road too much yet, but it fits into the picture in that it's not a proprietary, close sourced option like hyper.sh and Rackspace Carina :) It also now has far more adoption and mindshare than those two relatively niche services.

bradrydzewski commented 7 years ago

I'm going to list some challenges based on my conversation with @gtaylor. I'm not a kubernetes expect, so my apologies if I misinterpreted the discussion.

kubernetes has no notion of sequential or chained jobs. This will need to be simulated
drone uses a single shared network for all containers in the build process. This means service containers are available at localhost. This could pose a challenge depending on how we implement kubernetes support. Using multiple-pods would prevent this
drone support single-machine fan-in / fan-out. This means build steps can be running in parallel and have access to the underlying build workspace (where the code is cloned). In kubernetes, if using multiple-pods, this could be difficult since we can only mount the volume to a single pod at a time.

These are some of the main challenges that we will face with native kubernetes support, as I understand it.

We could definitely create a very basic prototype implementation that showcases drone using kubernetes as the backend, but it would have some initial limitations:

steps run sequentially, without any parallelism
service containers would use custom hostnames, and not localhost

Perhaps with a basic implementation in place, we could engage the kubernetes community and use it as a starting point and figure out how to fill in the remaining gaps.

JeanMertz commented 7 years ago

As a reference point: we are currently using Jenkins on top of Kubernetes, together with some plugins (one of them being the kubernetes-plugin), to simulate what I'd like Drone to do/represent.

Jenkins comes with a lot of bagage (mostly good, some bad, some ugly), but the current set-up looks something like this:

We run Jenkins on top of GKE (Google Container Engine)
the concept of build nodes is translated to single pods
- each job on Jenkins creates a single pod
- this pod represents a one-time-use "node"
- within this pod, the job is started
- we run our tests in parallel, for each parallel process, we launch one container, so some tests run 22 containers in a single pod
- this ensures localhost works across containers
- it also ensures data is shared across volumes
we also have the "autoscale" feature of GKE enabled, this means that if too many jobs are being scheduled, GKE will start up a new node and add it to the node-pool of out Kubernetes cluster
all our nodes run on preemptible machines (cheaper, but "unreliable"), in practice, this means once every 1000 or so runs, a job fails because the node was deleted on GCE, but we accept this
in effect, this means we have near-infinite scalability of our CI environment

bradrydzewski commented 7 years ago

@JeanMertz is this something you would be willing to help implement? I have no real world experience with Kubernetes and have quite a lot on my plate. Perhaps if this were a community effort it would have more of a chance of succeeding. What do you think?

jmn commented 7 years ago

Hi,

kubernetes has no notion of sequential or chained jobs. This will need to be simulated

I am not sure if this is what is meant but there are Init Containers:

An init container is exactly like a regular container, except that it always runs to completion and each init container must complete successfully before the next one is started.

bradrydzewski commented 7 years ago

@jmn I think perhaps a better way of describing the issue is that kuberenetes does not easily map to the drone yaml at this time. The drone yaml executes batch steps, with linked services, and needs to evaluate whether or not the step should be executed at runtime based on results of prior steps.

Consider this configuration:

pipeline:
  backend:
    image: golang
    commands:
      - go build
      - go test
  frontent:
    image: node
    commands:
      - npm install
      - npm build
  publish:
    image: plugins/docker
    repo: foo/bar
    when:
      event: push
  deploy:
    image: plugins/ssh
    shell:
      - docker pull foo/bar
      - docker stop foo/bar
      - docker run foo/bar
    when:
      event: deployment
      branch: master
  notify:
    image: plugins/slack
    channel: dev
    when:
      status: [ success, failure ]

services:
  redis:
    image: redis:latest
  mysql:
    image: mysql:latest

This doesn't mean it is impossible, though. The suggestion by @JeanMertz is really interesting. His suggestion is that each step should be its own pod, with its own set of services, and Drone would handle orchestrating sequential pod execution to emulate build steps.

Unfortunately I do not have any experience with kubernetes outside of reading a few blog posts, so it is not something I will be able to implement at this time. Community contributions very welcome :)

bradrydzewski commented 7 years ago

I should point out that I'm also not connected to the kubernetes community. If there are individuals in the kubernetes community that you think might be interested in helping implement a native-kubernetes CI system, please help them get in touch @ gitter.im/bradrydzewski

webwurst commented 7 years ago

I would like to help out where I can. We are using a small/cheap Kubernetes cluster for some time for some open-data projects: https://github.com/codeformuenster/kubernetes-deployment

And we used Drone to create Docker images for ARM a while ago: https://github.com/armhf-drone-plugins

Haven't played with Drone 0.5 yet unfortunately. And constraint would be time, as always ;)

On Thu, Feb 2, 2017 at 12:35 PM Brad Rydzewski notifications@github.com wrote:

I should point out that I'm also not connected to the kubernetes community. If there are individuals in the kubernetes community that you think might be interested in working on this feature, you should send them to the drone gitter channel.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/drone/drone/issues/1815#issuecomment-276934423, or mute the thread https://github.com/notifications/unsubscribe-auth/AANxonZZMqXsm0vD1EG311CXUBH4JFJcks5rYb-DgaJpZM4KVKD2 .

derekperkins commented 7 years ago

@clintberry just got drone working on kubernetes and is working on a helm install package

benschumacher commented 7 years ago

I've used a slightly modified version of this repo from @vallard to deploy Drone to Kubernetes:

https://github.com/vallard/drone-kubernetes

Works well enough, though it does bypass the Kubernetes scheduler, and connects to the docker daemon on the host directly. Would definitely be interested in a solution that could work within that context.

One note: I think trying to map some of Drone's concepts to Kubernetes directly isn't going to help make this happen. In general, I think the design @gtaylor suggested above has some merit, though I'm not sold that having a new namespace-per-build is necessary. Working within the context of a pod isn't too far off what Drone is doing right now to run builds, it would just require something Kubernetes-specific logic to ensure that various "components" within a Pod are a) started in the right order, and b) executed serially, c) etc. Keep in mind that the main benefit of a Pod within Kubernetes is to link together dependent containers w/o relying on cluster-wide functions like service discovery, services, ingress controllers, etc.

I think starting w/ a set of requirements around what is expected from build scheduler, including some of the newer features around shared volumes, matrix builds, fan-out, fan-in, could help clarify what is required. The effort should start small, too, just solve a simple use case that 1) clones a repo, 2) executes a build step, 3) collects logs from these steps. I'd be happy to carve out some time to look into this deeper, but I doubt I'd have much time to contribute much in the way of coding in the near future.

bradrydzewski commented 7 years ago

@benschumacher my ultimate goal is to create a compiler of sorts (note that it is not entirely vaporware, as I do have certain pieces working). The compiler would consist of frontends which would take different configuration formats (drone.yml, travis.yml, bitbucket-pipelines.yml) and compile them down to an intermediate representation. I have an implementation of this that works with the .drone.yml and bitbucket-pipelines.yml.

I am working on a formal specification for the intermediate representation: https://github.com/cncd/cncd/blob/master/content/spec/ir.md

Ideally this intermediate representation would work with multiple backends, where a backend is a container engine (such as Docker) or an orchestration engine (such as Kubernetes). I have a working backend for Docker and am confident this could work for LXD and Rocket. I am not sure what changes would be required for this to work with Kubernetes, however, I am optimistic that this is a solvable problem.

think starting w/ a set of requirements around what is expected from build scheduler, including some of the newer features around shared volumes, matrix builds, fan-out, fan-in, could help clarify what is required

I think the specification for the IR gets us closer to this goal. It might still be too docker specific though, so would love to hear your feedback, and would love to have you as part of the working group (that invite extends to anyone in this thread as well).

I'd be happy to carve out some time to look into this deeper, but I doubt I'd have much time to contribute much in the way of coding in the near future.

At this stage if you can participate in more of an architectural (non-coding) capacity it would still be tremendously helpful. The backend implementations tend to be quite small, so if we get the IR specification in a good place the implementation should hopefully be pretty straightforward.

clintberry commented 7 years ago

Forgive my ignorance, but I don't understand what you are all trying to accomplish here. Kubernetes just uses docker under the hood, and Drone runs great in kubernetes right now. I installed the drone server as a deployment, and drone agents as a deployment. I can scale the agents now with 2 clicks as needed. I don't think adding a tighter integration to kubernetes gives you anything special, but this is where my ignorance comes in. What features are you looking for with a deeper kubernetes integration?

kop commented 7 years ago

Forgive my ignorance, but I don't understand what you are all trying to accomplish here. Kubernetes just uses docker under the hood, and Drone runs great in kubernetes right now. I installed the drone server as a deployment, and drone agents as a deployment. I can scale the agents now with 2 clicks as needed. I don't think adding a tighter integration to kubernetes gives you anything special, but this is where my ignorance comes in. What features are you looking for with a deeper kubernetes integration?

@clintberry, Such setup is very limited. With Kubernetes scheduler the following features can be achieved:

Avoid Docker In Docker usage;
"Real" scaling, when K8S pod is created for every task and cluster scales automatically up and down to fulfil CPU/RAM requests;
When you run DIND setup, any of services you create in your drone.yml will be deployed to the same host your agent is running in. This is bad, they should deploy to the least busy host instead;
I believe there are a lot more, those are just cases that came to my head instantly.

clintberry commented 7 years ago

Avoid Docker In Docker usage

I don't know if I am comfortable letting my CI engine spin up Kubernetes pods in production. I think I would rather keep the Docker-in-Docker methods for isolation/security. I'm sure you could still get some sort of isolation/security with drone-created pods, but is it worth the hassle?

"Real" scaling...

I am new to drone, so maybe I was wrong in assuming this, but I assumed I would be able to run only one concurrent job/build per agent that is connected to my drone server. For me, this is ideal because I can control the amount of resources that my build system uses. I don't want infinite scaling of my build system using precious production resources.

When you run DIND setup, any of services you create in your drone.yml will be deployed to the same host your agent is running in. This is bad, they should deploy to the least busy host instead;

I can see why you would like that. Especially if you have large services you need to spin up. I still don't want anything running outside of my agent docker, but I totally could see why you would want this. But at the same time, each agent gets distributed to kube workers according to load, so you get at least some distribution of resources, but certainly not to the level you are suggesting here.

I understand I am probably being too narrow-minded on this. I apologize if I am coming across confrontational. I am just trying to understand your use cases a bit more.

kop commented 7 years ago

I don't know if I am comfortable letting my CI engine spin up Kubernetes pods in production. I think I would rather keep the Docker-in-Docker methods for isolation/security. I'm sure you could still get some sort of isolation/security with drone-created pods, but is it worth the hassle?

I don't think it's the case really. K8S offers way better isolation than DIND. Since the DIND instances run in privileged mode, they can easily escalate to the host, and you probably don’t want this.

I am new to drone, so maybe I was wrong in assuming this, but I assumed I would be able to run only one concurrent job/build per agent that is connected to my drone server. For me, this is ideal because I can control the amount of resources that my build system uses. I don't want infinite scaling of my build system using precious production resources.

I don't really follow Drone development right now. And use Jenkins instead only because it has K8S integration. How it works in Jenkins: build nodes are created dynamically when a build is queued. I can control the limit of how much agents I allow to run concurrently. When my build queue is empty - I run no nodes and don't pay for anything. When I have a bunch of builds to run - nodes are created automatically to allow fast thruput of my builds. For me this is ideal.

I understand I am probably being too narrow-minded on this. I apologize if I am coming across confrontational. I am just trying to understand your use cases a bit more.

We are here to discuss and find best solutions, don't we? :)

JeanMertz commented 7 years ago

I don't think it's the case really. K8S offers way better isolation than DIND. Since the DIND instances run in privileged mode, they can easily escalate to the host, and you probably don’t want this.

Indeed. It's far safer to let Kubernetes manage the scheduling of the containers for you (since that's what it is, a container scheduler), than to give free-reign access to your Docker daemon without any checks and balances on how it will behave.

Furthermore, if you cut out the scheduler in this case, then Kubernetes no longer has any accurate information on freely available system resources, since it doesn't manage them anymore, and thus your system will become more unstable when going the DinD route.

clintberry commented 7 years ago

I thought it was running docker inside of my drone agent pod? My drone agent is controlled by the scheduler and k8s knows its resources. I mean, you do mount the folder for the docker runtime, but you mount it inside your container and it executes inside the container. Perhaps I am missing something technically here.

JeanMertz commented 7 years ago

@clintberry yes, apologies. You are correct. I was confusing the DinD setup with the shared docker daemon setup.

In your case, you run a new Docker daemon inside a Docker container.

Still though, as @kop mentioned, you still need to run that initial (Kubernetes-managed) container in privileged mode, which increases the risk of problems.

Also, there's this blogpost from the original Docker in Docker creator warning about using such a system specifically for CI usage.

There, the "bind-mount docker socket" solution is also mentioned:

[this] can be avoided by bind-mounting the Docker socket into your Jenkins container instead

But this is precisely what you don't want to do in a Kubernetes managed environment, for all the reasons I mentioned above.

clintberry commented 7 years ago

Okay, that all makes sense... I guess for now I will make a specific kube worker for CI and mount the host docker socket. But I see why you are all doing what you are doing now. Thanks for the explanations :-)

bradrydzewski commented 7 years ago

I can provide a bit more context. This is going to be a bit long winded, so I apologize in advance ...

We are spending a lot of time building a world-class scheduler for Drone. This scheduler needs to distribute builds across many servers, taking resource usage, architecture (windows vs linux) and user-defined parameters into account. The scheduler needs to be efficient, fault tolerant and stable. This piece of software is incredibly tedious to write, and it is easy to introduce bugs that are hard to reproduce and debug. The scheduler represents a large chunk of the Drone codebase, and takes up a significant amount of my time.

Kubernetes and Swarm and Hyper and ECS and others also have advanced scheduling and orchestration capabilities. They also have auto-scaling. There is a clear overlap in functionality that has me asking myself if I should continue to invest so much of my time in Drone's internal scheduling and orchestration capabilities, especially given the pace of development and number of man hours being invested in solutions like Kubernetes.

I think the smart move is for Drone to have a very basic embedded scheduler for small installations, but to use Kubernetes / Swarm / etc for larger scale and more elastic installations, with more advanced resource scheduling needs.

I also think we should really re-consider the build queue. Right now every CI system out there (Jenkins, Travis, Circle, Drone and others) has a build queue. You see builds sitting in a queue waiting for an available server. With high volume you could see your build sitting in the queue for 30 minute or more.

In a world where we have elastic compute and are pushing the idea of serverless, the build queue feels so 2003. I would like to remove the build queue from Drone and rely on orchestration systems like Kubernetes / Swarm / etc to handle elasticity so that a build is never pending ever again. If a cluster lacks capacity, it is automatically increased to execute your build with little or no wait.

Imagine one day you download and configure Drone with access to your Kubernetes environment and you automatically have the ability to elastically scale from 1 to 1000 builds servers. Or you hook it up to Hyper with a seemingly infinite pool of bare metal servers and per-second billing, so that you no longer pay or idle servers and unused compute time. This is the future I want.

derekperkins commented 7 years ago

Imagine one day you download and configure Drone with access to your Kubernetes environment and you automatically have the ability to elastically scale from 1 to 1000 builds servers. Or you hook it up to Hyper with a seemingly infinite pool of bare metal servers and per-second billing, so that you no longer pay or idle servers and unused compute time. This is the future I want.

So much winning. :) I think it's a very smart move to outsource the scheduling to purpose built scheduling systems, and focus your dev time on true differentiators.

I would like to remove the build queue from Drone and rely on orchestration systems like Kubernetes / Swarm / etc to handle elasticity so that a build is never pending ever again. If a cluster lacks capacity, it is automatically increased to execute your build with little or no wait.

I'm not sure that the queue should disappear. While waiting for builds is annoying, that will always be an option people will need/want. If you're running kubernetes on your own bare metal, you're not necessarily going to be able to auto-scale. Same goes if you're under strict budget restrictions. In those cases, you are going to be ok waiting until spare capacity is available.

For people who do have access to autoscaling however, drone could export metrics like "jobsInQueue", that you could use as a scaling metric for drone workers, making it possible to run as many in parallel as you want.

gtaylor commented 7 years ago

If you're running kubernetes on your own bare metal, you're not necessarily going to be able to auto-scale

This doesn't really matter, since you're firing off Kubernetes Jobs. You end up with queue-like behavior. Whether you choose to auto-scale it doesn't matter.

clintberry commented 7 years ago

This doesn't really matter, since you're firing off Kubernetes Jobs. You end up with queue-like behavior. Whether you choose to auto-scale it doesn't matter.

Yes, if kubernetes doesn't have enough resources, it will wait until it does. But I may not want to max out my k8s cluster. I, for one, would like to put limits on how many resources my build process can use and queue up anything beyond that, without sucking up all the ram on my cluster.

@bradrydzewski - I agree with your vision. Drone is awesome and my company will help support you with whatever direction you take.

bradrydzewski commented 7 years ago

Yes, if kubernetes doesn't have enough resources, it will wait until it does. But I may not want to max out my k8s cluster.

I believe this is a problem that can be solved. Kubernetes has the ability to assign pods to a subset of nodes in your cluster. This should be conceptually no different than installing the agent on a subset of nodes in your cluster which in turn limits builds to these nodes.

... but I think we are getting ahead of ourselves, which is my fault for injecting my sometimes far-fetched ideas into the thread 😄 . I think first step is to create a specification and proof of concept that schedules builds using Kubernetes as a backend. Right now I'm most interested in proving whether or not it is possible. It is absolutely possible we may find a Kubernetes backend doesn't make sense.

So if and when we have a working proof of concept we can present to the community and collect feedback and potential concerns and edge cases.

bradrydzewski commented 7 years ago

I agree with your vision. Drone is awesome and my company will help support you with whatever direction you take.

and thanks! If you are interested in helping shape how Drone runs on Kubernetes I would definitely encourage you to join the (soon to be formed) working group.

derekperkins commented 7 years ago

This doesn't really matter, since you're firing off Kubernetes Jobs. You end up with queue-like behavior. Whether you choose to auto-scale it doesn't matter.

For sure, that makes lots of sense. I wasn't sure how deeply integrated you were thinking.

Yes, if kubernetes doesn't have enough resources, it will wait until it does. But I may not want to max out my k8s cluster. I, for one, would like to put limits on how many resources my build process can use and queue up anything beyond that, without sucking up all the ram on my cluster.

The best part about this proposal is that drone shouldn't have to worry about your resource consumption, but instead, rely on kubernetes to expose resource restrictions for jobs.

gtaylor commented 7 years ago

But I may not want to max out my k8s cluster. I, for one, would like to put limits on how many resources my build process can use and queue up anything beyond that, without sucking up all the ram on my cluster.

@clintberry Then put resource limits on your build namespace. Kubernetes offers all kinds of ways for you to throttle your builds.

so0k commented 7 years ago

In a world where we have elastic compute and are pushing the idea of serverless, the build queue feels so 2003.

Gitlab CI with Multi runner and docker-machine provisioner, kind of does this (albeit in an ugly way) https://github.com/honestbee/gitlab-infra#registering-multi-runner

@MacTynow might be interested as he set up our drone Helm charts

ekozan commented 7 years ago

i'm realy interessed about this one, include me on the work group :p

bradrydzewski commented 7 years ago

Another option worth considering is to create a version of the drone agent that only runs a single build, and is bundled with dind for running build steps. The workflow would look something like this:

drone receives github hook
drone schedules special one-time agent to run on kubernetes cluster
kubernetes starts agent container (which is bundled with dind)
agent executes build pipeline using embedded dind
agent sends results back to drone
agent exits
kubernetes deletes agent container

I realize this isn't the most optimal solution, and isn't as cool as scheduling per-step pods, but keep in mind that I am thinking of baby steps here. It is an interim solution ... you know ... better a diamond with a flaw than a pebble without

I also think the google folks might be doing something similar, but not sure https://github.com/GoogleCloudPlatform/jenkernetes#note

pros to this approach:

uses the kubernetes scheduler!
does not require a major overhaul to how drone works!!! (major bonus points)
does not mutate the host
does not schedule containers on the k8s host that k8s is not aware of

cons to this approach:

inability to distribute a single build across multiple machines
dind may not work in some environments
ephemeral build environments lose ability to cache docker image downloads. Mounting the cache is only an option when running 1 build per host

technical considerations:

the drone-docker plugins uses dind. Can we run dind in dind?

And in case someone is considering posting a link to this blog post keep in mind that it is quite old and docker has seen much progress since. I personally support using dind where it makes sense, and do not view it as an absolute evil :)

tonglil commented 7 years ago

drone schedules special one-time agent to run on kubernetes cluster

To clarify, would this be a Job resource? Because I think that really fits the bill here.

inability to distribute a single build across multiple machines

Why/where is this a benefit? Sorry I didn't find anything concrete on this anywhere earlier on.

dind may not work in some environments

and

the drone-docker plugins uses dind. Can we run dind in dind?

I am not sure if you want to run dind completely, run drone-docker (and other whitelisted drone plugins) as a sibling and the build in dind, or run all the containers as siblings (probably not idea for the drone-docker plugin's isolation levels, though we have experienced the desire to be able to specify the use the caching of layers for large builds. One way maybe of "trusted repos"?).
On another note, I am not sure what the consequences of having Drone run "sibling" containers next to K8s managed containers are (K8s cleans up containers and images when disk space/inodes get low, but not sure if it would touch the ones Drone created).

bradrydzewski commented 7 years ago

To clarify, would this be a Job resource? Because I think that really fits the bill here.

Yes probably, I am still learning this stuff 😄

Why/where is this a benefit? Sorry I didn't find anything concrete on this anywhere earlier on.

There is another github issue #1814 that discusses a fan-out to run build steps in parallel on multiple machines. This probably needs to remain out of scope. So ignore what I said :smile:

I am not sure if you want to run dind completely, run drone-docker (and other whitelisted drone plugins) as a sibling and the build in dind, or run all the containers as siblings

If you are running drone:0.5 agents on Kubernetes you are mounting the docker socket into the agent container, and the agent is launching sibling containers. The complaints I hear is that people don't like agents bypassing Kubernetes and interacting directly with the host machine docker daemon.

I think having some sort of roadmap would definitely help here, since I think this will take multiple iterations before we get to the ideal implementation. Should the initial implementation focus on using the Kubernetes scheduler and eliminating agents? If yes, the initial implementation should probably continue using sibling containers like you mentioned. If the initial focus should be on avoiding direct interaction with the host machine docker daemon, then we probably need to look at other solutions.

On another note, I am not sure what the consequences of having Drone run "sibling" containers next to K8s managed containers are (K8s cleans up containers and images when disk space/inodes get low, but not sure if it would touch the ones Drone created).

This has not been raised as an issue, but perhaps the teams using drone:0.5 on Kubernetes can provide more insight.

jmccann commented 7 years ago

the drone-docker plugins uses dind. Can we run dind in dind?

Yes you can. And I was just playing with gitea/drone inside minikube and did such a setup with an agent

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: drone-agent
spec:
  replicas: 1
  template:
    metadata:
      labels:
        name: drone-agent
    spec:
      containers:
        - image: docker:1.12.6-dind
          name: dind
          ports:
            - containerPort: 2375
              protocol: TCP
          securityContext:
            privileged: true
        - image: drone/drone:0.5
          imagePullPolicy: Always
          name: drone-agent
          command:
            - "/drone"
            - "agent"
          env:
            - name: I_UNDERSTAND_I_AM_USING_AN_UNSTABLE_VERSION
              value: "true"
            - name: I_AGREE_TO_FIX_BUGS_AND_NOT_FILE_BUGS
              value: "true"

            - name: DRONE_DEBUG
              valueFrom:
                configMapKeyRef:
                  name: drone-config
                  key: agent.debug.is.enabled
            - name: DRONE_SERVER
              valueFrom:
                configMapKeyRef:
                  name: drone-config
                  key: agent.drone.server.url
            - name: DRONE_SECRET
              valueFrom:
                secretKeyRef:
                  name: drone-secrets
                  key: server.secret

            - name: DOCKER_HOST
              value: tcp://localhost:2375
          resources:
            requests:
              cpu: 1
              memory: 1G

And it seemed to work just fine with the following .drone.yml

pipeline:
  build:
    image: alpine:3.4
    commands:
      - echo 'hi'

  docker:
    image: plugins/docker:latest
    repo: jmccann/test
    dry_run: true

Also, it seems pretty simple to launch/run/delete singular containers in kubernetes even:

$ kubectl run busybox --image=busybox sleep 5  --restart='Never'
pod "busybox" created
$ kubectl get pods --show-all
NAME                            READY     STATUS    RESTARTS   AGE
busybox                         1/1       Running   0          9m
$ kubectl get pods --show-all
NAME                            READY     STATUS    RESTARTS   AGE
busybox                         1/1       Running   0          9m
$ kubectl get pods --show-all
NAME                            READY     STATUS      RESTARTS   AGE
busybox                         0/1       Completed   0          9m
$ kubectl delete pods busybox
pod "busybox" deleted
$ kubectl get pods --show-all
NAME                            READY     STATUS    RESTARTS   AGE

Seems really similar to docker run command.

~~I'm working on cleaning up my gitea/drone minikube config with instructions to maybe help people get to playing with Drone in kubernetes locally.~~

https://github.com/jmccann/drone-in-minikube

paultiplady commented 7 years ago

Interested in following along with this too, I'm new to Drone but familiar with Kubernetes. @gtaylor you mentioned an early pod-based approach a while go in this thread, is that code available somewhere to look at?

gtaylor commented 7 years ago

@paultiplady That was me doing some tinkering outside of Drone with the Kubernetes API. I couldn't figure out anything that I thought was much of an improvement on the current state of things (running agents).

ekozan commented 7 years ago

I really love @jmccann solution need to be more tested :P

derekperkins commented 7 years ago

@bradrydzewski Did you and @kelseyhightower end up working on this at NEXT?

bradrydzewski commented 7 years ago

@derekperkins we did not, but I think I'm pretty clear on the design he proposed. I'm planning to take a shot at a basic implementation in the coming weeks. I'll post updates as I have them. Was good meeting you at Next!

harness / harness

Provide (optional) ability to use Kubernetes as the runtime engine instead of Docker #1815