Open lossyrob opened 9 years ago
@mojodna, what's your thoughts on using something like troposphere? We use it at Azavea with success to ease some of the pain of working with cloud formation configuration. An example of our deployment setup is here: https://github.com/geotrellis/geotrellis-ec2-cluster/tree/develop/deployment. We have a Makefile that makes launching stacks a command like make leader-stack
, which uses environment variables to launch cloud config commands via cloudformation configurations constructed by troposphere, with user data separated out into it's own file.
@hectcastro is the wizard behind this wizardry, but I've been using it for a bit, and think that moving towards some deploy structure like this would help make a change like the one in this issue. What do you think?
Just wanted to note that the Makefile
approach will soon be replaced by a small CLI tool in https://github.com/geotrellis/geotrellis-ec2-cluster/pull/26, but the Troposphere stuff will remain.
To add some more color, CloudFormation JSON has been relatively painful to deal with, which is why we looked for alternatives, and ultimately settled on Troposphere. Since Troposphere definitions are written in Python, we read a separate UserData YAML file at Troposphere -> CloudFormation JSON compile time.
Another alternative that has matured a bit since we made the Troposphere decision is Terraform. Might be worth taking a look at that if you're interested in side-stepping CloudFormation altogether.
:+1:
I've found CF template creation and testing quite error-prone, so I'm totally open to an abstraction that makes it harder to screw up.
I'll take a look at Troposphere and Terraform.
(This got a bit meta, sorry.)
Ugh. I've yet to see provisioning tools that I like. Docker comes close (and I almost investigated Aminator, though I'm too impatient to wait for AMIs to build), as it comes closest to producing immutable build artifacts that are the app (I feel fortunate to have personally bypassed Chef and Puppet after having witnessed huge amounts of time spent managing them; as a Docker provisioner, ansible would be my first choice after sequential shell scripts). Managing state changes and anticipating conditions sucks and seems like it can be avoided for components with no persistent state by replacing them with CI build artifacts.
However, orchestrating the rest of the mess remains problematic, hence my attraction to CloudFormation vs. anything else (particularly when CF templates can be created from existing AWS resources and then customized). Until recently, it had been working great. (Docker Hub switched the repository for nginx and pull
commands seem to be timing out, so watercolor tile rendering is currently disabled until I have a chance to debug that--I don't think CF is to blame though, even if rollbacks don't work (because the old versions don't work either any longer).)
I'm the core of our ops team (in addition to my other duties), so I don't have a whole lot of patience for components that require care and feeding. The flip side of this is that I'll often write off incredibly powerful tools if I don't think I need their power (or acquiring it takes too long).
I don't understand the point of Troposphere vs. having something that will validate CF JSON (using JSON schema or similar). Getting runtime errors is nice (but equivalent to linting / validation errors), but it's a thin, possibly incomplete abstraction over CF JSON.
Terraform would be appealing if I were targeting anything other than AWS in the same way that Fog abstracts cloud provider APIs. However, this assumes SQS, so that seems like a valid assumption (tools like elasticmq notwithstanding).
Anyway, I think the best solution to this issue is to parameterize the npm package to install (defaulting to vapor-clock on GitHub, at least until a release) within the existing CF template, similar to how the SQS queue name is provided.
For what I've been using it for, Troposphere has been a lot clearer to me about how things are configured. Defining configuration params up front like this is clearer for me to grok and edit then having params buried in JSON like https://github.com/stamen/vapor-clock/blob/master/aws/cloudformation.json#L246. So besides a linter, it's a more intuitive way for me to work with cloud formation. It also fits into the deployment process that we've been iterating on at Azavea for a bit, so that's why it's attractive to me.
I get the "writing off powerful tools" thing. I do that as well, especially when it adds complexity where I don't need it. And I image as the core dev ops guy, it'd be another thing to manage/maintain. In my scenario, we have a dedicated dev ops in Hector, and I usually end up just listening to what he says (and that's worked out for me really well). So that's where I'm coming from in trying to get more advanced deployment tooling in.
There's also some things I'd like to do with vapor clock, or something similar, that will probably require an environment that will go beyond what user data should be pumping into a bare image on instance startup. For instance, if I wanted to be calling out not just to GDAL shell commands, but some sort of scala or python setup with scripts installed, that would add a whole lot of roughness to the cloud formation file, and make it tough to get a development environment in lockstep with what cloud formation would be throwing up on EC2. A way around that is to bake AMI's with exactly what you need, using a provisioning system that can also bake vagrant boxes, so your dev environment matches the deploy. What we do know at Azavea in a couple settings is run ansible against vagrant boxes for dev, and then use packer to turn those into AMIs. I'm interested in exploring how docker could make that process better, but I've so far eschewed using the possibly better tool that I'd have to learn how to use (and sort of waiting for Hector to get around to it, as he's been wanting to do as well). But, the stuff I'm looking to do with vapor-clock for OAM, pre-processing for GeoTrellis ingests and some other projects seems like a prime target to experiment with a "provision a container, run the container in dev, spin the container up with cloud formation" sort of process.
This is all to say, I hear you about not bringing in heavy tooling when it's not needed, but it might end up being the case that it'll be needed, at least for what we're trying to do. If not, then it's good to talk out the examples, but if so I'll probably hack something together and propose something explicit.
First of all, just wanted to say that I took a closer look at what's going on in this project's CloudFormation file, and it seems pretty cool.
Ugh. I've yet to see provisioning tools that I like. Docker comes close (and I almost investigated Aminator, though I'm too impatient to wait for AMIs to build), as it comes closest to producing immutable build artifacts that are the app (I feel fortunate to have personally bypassed Chef and Puppet after having witnessed huge amounts of time spent managing them; as a Docker provisioner, ansible would be my first choice after sequential shell scripts). Managing state changes and anticipating conditions sucks and seems like it can be avoided for components with no persistent state by replacing them with CI build artifacts.
Right now we use Packer to build AMIs. It is flexible with how the instance gets provisioned. For us, most of the provisioning process happens with Ansible, but support for other CM tools exist (as well as a plain shell provisioner).
It does take a while, but you can also fire off AMI building jobs in parallel pretty easily, and there is also a chroot
AMI builder that I haven't played with yet, but it is on my ever growing list of tools to evaluate.
Once we have an AMI, its ID becomes a parameter we feed to CFN. Usually it gets associated with an ASG launch configuration, and we spin up as many instances in the ASG as we need.
Docker could be a good middle ground for this specific scenario because most of the work to set things up would go into the container image, and then you only really need to have the Docker daemon installed on the EC2 instance, and probably the GPU setup.
However, orchestrating the rest of the mess remains problematic, hence my attraction to CloudFormation vs. anything else (particularly when CF templates can be created from existing AWS resources and then customized). Until recently, it had been working great. (Docker Hub switched the repository for nginx and pull commands seem to be timing out, so watercolor tile rendering is currently disabled until I have a chance to debug that--I don't think CF is to blame though, even if rollbacks don't work (because the old versions don't work either any longer).)
I'm the core of our ops team (in addition to my other duties), so I don't have a whole lot of patience for components that require care and feeding. The flip side of this is that I'll often write off incredibly powerful tools if I don't think I need their power (or acquiring it takes too long).
I don't understand the point of Troposphere vs. having something that will validate CF JSON (using JSON schema or similar). Getting runtime errors is nice (but equivalent to linting / validation errors), but it's a thin, possibly incomplete abstraction over CF JSON.
I actually felt the same way the first time I came across Troposphere. A few of the reasons why that changed:
So far, I haven't hit a missing resource in Troposphere that exists in the CFN documentation. It almost happened once with a new RDS feature (I forget what it was now), but it was added and released the day I tried to use it.
Terraform would be appealing if I were targeting anything other than AWS in the same way that Fog abstracts cloud provider APIs. However, this assumes SQS, so that seems like a valid assumption (tools like elasticmq notwithstanding).
There is support for DigitalOcean and GCE, but it isn't masked behind the same API like Fog tries to do. Personally, I'm OK with that. The promise of Fog's abstraction layer is great, but in practice I've found Fog confusing to get working and difficult to debug (compared to just using an AWS SDK).
With Terraform though, your point against Troposphere potentially being incomplete is definitely true. For example, there is no SQS resource. I think this will change fairly quickly though due to the most recent release of Amazon's Go SDK. Terraform master
is making use of it, which should allow them to support more resources than they were able to support with the unofficial Go SDK for Amazon, goamz
.
Anyway, I think the best solution to this issue is to parameterize the npm package to install (defaulting to vapor-clock on GitHub, at least until a release) within the existing CF template, similar to how the SQS queue name is provided.
@lossyrob @hectcastro thank you both for your detailed responses and for understanding the gist of my objections and not just writing me off as a crotchety old man ;-) (To retread old ground and attempt to re-summarize my circumstances, it's quite common for me to configure stacks and not return to them for months vs. having a daily or weekly relationship with them on an ongoing basis (rather than just at the setup stage).)
Speaking of gists, here's the CF template for Toner that uses CoreOS + Docker: https://gist.github.com/mojodna/327ed929a31a4eb978a4
Assuming that this solves a problem for you (hopefully on an ongoing basis), I'm totally willing to ignore my fear of change and trust you guys to make good decisions about how best to handle provisioning (provided you're available to answer questions a few months down the line when I find the need to make a change).
If you're sold on Troposphere, that's good enough for me (here, anyway; it could grow on me for other things).
Docker could be a good middle ground for this specific scenario because most of the work to set things up would go into the container image, and then you only really need to have the Docker daemon installed on the EC2 instance, and probably the GPU setup.
Yes. The annoying bit is the speed at which Docker images are fetched, but I'll trade that for not having to regularly rebuild and manage AMIs (which is probably not as bad as I'm making it out to be).
The cloudformation script hard codes to pull down vapor-clock master, which makes it hard to test/run custom operations that aren't in that repo branch. This feature would allow parameterization of the repo/branch pulled down, so stacks could be spun up that run against a different set of operations than what master has in it.