taskcluster / ec2-manager

Mozilla Public License 2.0
2 stars 14 forks source link

Many improvements #56

Closed jhford closed 6 years ago

jhford commented 6 years ago

Here's three major commits to the ec2-manager.

Polling DB for Terminations

This basically lets us know why instances shut down. We're already able to track a couple things about instance terminations, mainly that they happen. This patch adds a simple poller that will check every 100s (i think) for up to 200 terminations and update them in the database.

These are the error codes: https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html#api-error-codes-table-server

and they'll let us see why things which were running ended. Since we're tracking instances terminate in good and bad states, as well as some metadata, we can do things like seeing whether a specific region/zone/type triplet is getting killed by amazon a lot. We could also see things like the average time of a specific worker type. Lots of oppurtunity for inspection!

Samples from my local testing:

testing=> select * from instances;
         id          | workerType |  region   |     az     | instanceType |     state     |   imageId    |        launched        |       lastEvent        |            touched            
---------------------+------------+-----------+------------+--------------+---------------+--------------+------------------------+------------------------+-------------------------------
 i-0e54d792b6eb7ab41 | tutorial   | us-west-1 | us-west-1c | m3.large     | shutting-down | ami-f85c1e98 | 2018-02-05 22:38:30+01 | 2018-02-05 22:40:19+01 | 2018-02-05 22:40:24.936512+01
(1 row)

testing=> select * from terminations;
 id | workerType | region | az | instanceType | imageId | code | reason | launched | terminated | lastEvent | touched 
----+------------+--------+----+--------------+---------+------+--------+----------+------------+-----------+---------
(0 rows)

testing=> select * from terminations;
         id          | workerType |  region   |     az     | instanceType |   imageId    | code | reason |        launched        |       terminated       |       lastEvent        |            touched            
---------------------+------------+-----------+------------+--------------+--------------+------+--------+------------------------+------------------------+------------------------+-------------------------------
 i-0e54d792b6eb7ab41 | tutorial   | us-west-1 | us-west-1c | m3.large     | ami-f85c1e98 |      |        | 2018-02-05 22:38:30+01 | 2018-02-05 22:40:53+01 | 2018-02-05 22:40:53+01 | 2018-02-05 22:40:54.715294+01
(1 row)

testing=> select * from terminations;
         id          | workerType |  region   |     az     | instanceType |   imageId    |             code             |         reason          |        launched        |       terminated       |         lastEvent          |            touched            
---------------------+------------+-----------+------------+--------------+--------------+------------------------------+-------------------------+------------------------+------------------------+----------------------------+-------------------------------
 i-0e54d792b6eb7ab41 | tutorial   | us-west-1 | us-west-1c | m3.large     | ami-f85c1e98 | Client.UserInitiatedShutdown | User initiated shutdown | 2018-02-05 22:38:30+01 | 2018-02-05 22:40:53+01 | 2018-02-05 22:41:25.379+01 | 2018-02-05 22:41:25.380666+01
(1 row)

testing=> 

Remove key prefix concept from ec2-manager

Now that we've got atomic tag-on-create support for instances, we have no need for the ec2-manager to understand the janky key prefix system we used to use. Yay! '''NOTE''' that a corresponding provisioner patch will need to be written to change how SSH key creation is handled. We might also wish to make SSH KeyPair Name a user-configurable value in the worker type definition, but with a default provisioner-wide setting. Since this is >96h from when we deployed the tag-on-create, this should have no impact on running instances as running instances should have been created with accurate tags.

Storing of AWS Request outcomes

This is mainly intended for runInstances, but it was the same amount of work for logging all requests. There's tons of information that's super handy for figuring out the state of Amazon. We can use this information to make better choices about which region/zone/type triplets are in good shape and which are to be avoided. Since for special cased requests (basically runInstances), I'm tracking the workerType, we'll now be able to see things like the rate at which invalid parameters are submitted to the ec2 api. We should expose this information.

Samples from my local testing:

testing=> select * from awsrequests where method = 'runInstances';
  region   |              requestId               |    duration     |    method    | service | error |           called           |         code          |                  message                   | workerType |     az     | instanceType |   imageId    
-----------+--------------------------------------+-----------------+--------------+---------+-------+----------------------------+-----------------------+--------------------------------------------+------------+------------+--------------+--------------
 us-west-1 | 0d051dd5-3931-4b70-8fa6-bb40ea8f4bbc | 00:00:01.433223 | runInstances | ec2     | t     | 2018-02-05 22:28:43.905+01 | InvalidParameterValue | Invalid value 'z3.large' for InstanceType. | tutorial   | us-west-1c | z3.large     | ami-f85c1e98
 us-west-1 | 61da2107-f7ac-4407-8800-f3eb26bb09c5 | 00:00:02.140156 | runInstances | ec2     | f     | 2018-02-05 22:30:55.146+01 |                       |                                            | tutorial   | us-west-1c | m3.large     | ami-f85c1e98
 us-west-1 | dbc51c67-93eb-446e-bbc3-8d744efaddb0 | 00:00:01.887676 | runInstances | ec2     | f     | 2018-02-05 22:38:29.005+01 |                       |                                            | tutorial   | us-west-1c | m3.large     | ami-f85c1e98

The aws-request stuff might look weird, because it is. @jonasfj I submit this particular patch as an example of how annoying the AWS-SDK library is. See the hack I need to be able to have a promise interface where there's a functional timeout and logging of request ids :)

Flagging @imbstack and @walac for review CC @djmitche for thoughts on the SSH KeyPair configurability

jhford commented 6 years ago

This was pushed to master branch commit 3f8ea66f0a2afb353df873ce2d83db5660d4b4f0