hackoregon / devops-17

deployment tools for Hack Oregon projects
4 stars 3 forks source link

Project not updating in Integration Environment #47

Closed BrianHGrant closed 7 years ago

BrianHGrant commented 7 years ago

Emergency Response team had a successful build pushed to AWS integration environment at least 6 hours ago (https://travis-ci.org/hackoregon/emergency-response-backend/builds/216236594). Changes made in Django application are not showing up when loading page in web browser. Attempted to open through an incognito browser window to ensure was not a cache issue. Here is one example of a change on local machine:

screen shot 2017-03-29 at 6 46 32 am

And not showing up in integration:

screen shot 2017-03-29 at 6 47 39 am
MikeTheCanuck commented 7 years ago

Hi @BrianHGrant - looks to me like that build has deployed to the ECS service target. I can't explain why the Implementation Notes aren't visible.

Evidence for successful deploy

Here's the ecs-deploy.sh log from that build:

Running ecs-deploy.sh script...
Using image name: 845828040396.dkr.ecr.us-west-2.amazonaws.com/integration/emergency-service:latest
Current task definition: arn:aws:ecs:us-west-2:845828040396:task-definition/emergency-service:24
New task definition: arn:aws:ecs:us-west-2:845828040396:task-definition/emergency-service:25
Service updated successfully, new task definition running.

(The task definition "version" tag updated from 24 to 25)

And here's the current state of the service according to the AWS console:

image

(The "25" version of that image is RUNNING for the pair of redundant containers)

Is some code making it through?

What's weird to me - it appears that new parameters have been deployed since you last took a screenshot of the AWS-deployed endpoint: image

(I'm seeing "lat" and "long" that weren't there in your screenshot - were they there for you as well? If not, this implies that some code is making it through the deploy pipeline, even if not the Implementation Notes for that endpoint.)

Which code did the build use?

On line 252 of that build, we see $ git checkout -qf 19a62c56cfd6f7a09792eb2b45f8184fec383d74.

Looking at that commit, I definitely see the docstring you added in data/urls.py that matches the text of the Implementation Notes.

Puzzling.

MikeTheCanuck commented 7 years ago

The only theory I have left is this: the "25" containers never quite get to a state that ALB considers "healthy", and so they're never serving traffic in response to requests from you or me.

Instead, the "13" container/task is there chugging away all along, receiving all the traffic from ALB, and the "25" containers start, run for a few minutes, are deemed "unhealthy" and another pair are fired up from the "25" image, then the "unhealthy" pair are terminated.

That's what the ECS service Event Log is implying...

0ffc877e-ea29-4664-aac7-9e3b835a5b42 2017-03-29 12:26:18 -0700 service hacko-integration-EmerreponseService-1LC4181KR6KN5-Service-1WR6VWC6KKIEP has started 1 tasks: task 46e7869d-fcf7-4841-bf3a-2052c186515f. ada5de70-56a9-4502-bbde-2e3b5eac43c9 2017-03-29 12:25:53 -0700 service hacko-integration-EmerreponseService-1LC4181KR6KN5-Service-1WR6VWC6KKIEP has started 1 tasks: task 1eac303a-25b8-45fa-bdef-48a9e2bfffaf. 1ac27ae6-10da-47ae-8208-c9c8c5554f55 2017-03-29 12:25:41 -0700 service hacko-integration-EmerreponseService-1LC4181KR6KN5-Service-1WR6VWC6KKIEP deregistered 1 targets in target-group hacko-Targe-1V7HIUSN1UML6 10cbb7b3-5133-4df6-bc0d-40b75ab06be8 2017-03-29 12:25:41 -0700 service hacko-integration-EmerreponseService-1LC4181KR6KN5-Service-1WR6VWC6KKIEP has stopped 2 running tasks: task 45813b9d-d3e6-4130-841b-ca6cbea36617 task 1d40e1e3-ad4f-4896-9267-622044ad3de1. 94bc4dab-aab6-4b1f-a299-166902dec702 2017-03-29 12:25:41 -0700 service hacko-integration-EmerreponseService-1LC4181KR6KN5-Service-1WR6VWC6KKIEP deregistered 2 targets in target-group hacko-Targe-1V7HIUSN1UML6 a1de0e25-3a4f-47a5-8600-13b99b6ea5e0 2017-03-29 12:25:41 -0700 service hacko-integration-EmerreponseService-1LC4181KR6KN5-Service-1WR6VWC6KKIEP (instance i-04e4ff1f307addd28) (port 42538) is unhealthy in target-group hacko-Targe-1V7HIUSN1UML6 due to (reason Health checks failed) 877d8742-0df7-4e8e-981f-29499704bb46 2017-03-29 12:25:41 -0700 service hacko-integration-EmerreponseService-1LC4181KR6KN5-Service-1WR6VWC6KKIEP (instance i-04e4ff1f307addd28) (port 42537) is unhealthy in target-group hacko-Targe-1V7HIUSN1UML6 due to (reason Health checks failed) 772ff2a8-826d-45f3-800a-a473b23ce3e0 2017-03-29 12:22:16 -0700 service hacko-integration-EmerreponseService-1LC4181KR6KN5-Service-1WR6VWC6KKIEP registered 2 targets in target-group hacko-Targe-1V7HIUSN1UML6 2a543c9e-2cf0-4531-8d2f-72d18342d01c 2017-03-29 12:21:15 -0700 service hacko-integration-EmerreponseService-1LC4181KR6KN5-Service-1WR6VWC6KKIEP has started 2 tasks: task 1d40e1e3-ad4f-4896-9267-622044ad3de1 task 45813b9d-d3e6-4130-841b-ca6cbea36617.

BrianHGrant commented 7 years ago

@MikeTheCanuck, I did not include the query params in the above screenshot but they were not changed.

I included the implementation notes as a single example but basically no changes have been updating. Another example is the removal of the alarm levels details endpoint.

On my local machine:

screen shot 2017-03-29 at 12 40 00 pm

Still Showing in Hack Oregeon Integration:

screen shot 2017-03-29 at 12 35 24 pm

The: GET /emergency/alarmlevels/{alarmlevel_id}/ should no longer be in project.

It is hard to confirm at this point but I do not believe this started today but that basically no updates since the original push last thursday @ 11pm have appeared to the front end.

MikeTheCanuck commented 7 years ago

That's bad. This is something we will find a way to fix. I have a feeling that others of our projects are secretly in a similar boat, but that we haven't been paying as much attention to the lack of API updates.

I am going to go into some forensic-level detail here to drive to root cause. Apologies in advance.

Builds that appear to have deployed without a problem

This is the tail end of build 43, where Travis appears to have incremented the ECS task definition without error:

Running ecs-deploy.sh script...
Using image name: 845828040396.dkr.ecr.us-west-2.amazonaws.com/integration/emergency-service:latest
Current task definition: arn:aws:ecs:us-west-2:845828040396:task-definition/emergency-service:1
New task definition: arn:aws:ecs:us-west-2:845828040396:task-definition/emergency-service:2
Service updated successfully, new task definition running.

Here are additional builds whose deploy was similarly error free:

Builds that appear to have had a deploy problem

This is the tail end of build 58, where Travis couldn't confirm the newly-deployed container was running:

Running ecs-deploy.sh script...
Using image name: 845828040396.dkr.ecr.us-west-2.amazonaws.com/integration/emergency-service:latest
Current task definition: arn:aws:ecs:us-west-2:845828040396:task-definition/emergency-service:8
New task definition: arn:aws:ecs:us-west-2:845828040396:task-definition/emergency-service:9
ERROR: New task definition not running within 90 seconds

Here are additional builds whose deploy was similarly troubled:

What can we infer from this?

These are weak and unproven theories, but they're worth considering and trying to (dis)prove.

  1. Builds that have a deploy timeout don't always result in an unhealthy container (e.g. build 64 seems to have left behind a stable container in the cluster)
  2. Builds that have no deploy timeout don't always result in a healthy container (e.g. builds 87, 90, 97, 100, 105 don't seem to have supplanted build 64's one stable container)
  3. Little rhyme or reason to whether a deploy leads to containers that ALB will consider "healthy" and leave running.
MikeTheCanuck commented 7 years ago

Further evidence that the only "stable" container/task is the "13" one:

I dug into its detail, and ECS reports that it was Created On 2017-03-25 18:16:06 -0700.

Still mystifying how one of the deploys (task definition updates) that claimed "new task definition not running within 90 seconds" would be the one that stays alive despite all the other thrash among containers.

BrianHGrant commented 7 years ago

Sorry, closed issue in error earlier.

I experimented with renaming the settings dir back to the emerresponseAPI. Merge successful deploying the service:26 however still not updating.

Looking at the diff between deploy 13 and 14, I did add back in the added workers and timeout on gunicorn, removing and trying again.

BrianHGrant commented 7 years ago

No change with these removed.

Also added the sniffer package back in, incase this was somehow being used by the health check. No change.

MikeTheCanuck commented 7 years ago

Hey @BrianHGrant - some good news and some bad news.

Good news

After the latest deploy your container is no longer experiencing [CRITICAL] WORKER TIMEOUT. I just crawled an hour's worth of CloudWatch logs for your service, and every single task that gets fired up only has these log entries:

22:59:13
##############################
22:59:13
CONFIG SETTINGS
22:59:13
##############################
22:59:13
PROJ_SETTINGS_DIR emerresponseAPI
22:59:13
DEPLOY_TARGET integration
22:59:13
CONFIG_BUCKET hacko-emerresponse-config
22:59:13
########################################
22:59:13
USING integration CONFIG
22:59:13
USING THE hacko-emerresponse-config CONFIG BUCKET
22:59:13
########################################
23:02:08
Completed 704 Bytes/704 Bytes (205 Bytes/s) with 1 file(s) remaining download: s3://hacko-emerresponse-config/integration/project_config.py to emerresponseAPI/project_config.py
23:02:19
#### CONFIG COPY COMPLETE###

In all other unhealthy containers, we would see at least half of the instances report a handful of [CRITICAL] WORKER TIMEOUT errors before getting shut down and replaced.

Bad news

This hasn't made the container healthy enough for AWS to leave it running and bring it into service. They containers on version "29" are still getting replaced and shut down as rapidly as previous deploys.

Current theory

My current theory is that CloudFormation has kept around the single container with version "13" while it tries to build later-versioned stable containers, and until it's got stable replacements, it's leaving the single "good" container in place. It appears to be keeping at least 1 stable container as the "minimum healthy percentage" of 100%, and will keep firing up container replacements until it can get a better one in place.

I bet if you looked closely (if you haven't already), you'd notice that the code responding from the AWS Integration environment's /emergency/ endpoint has exactly the same functionality as we see in the code corresponding to the "13"th task definition deployed, i.e. build 64, or this commit.

That is, until we get a deployed container image that ALB considers "healthy", you're stuck on the old code because the old code is what's running in the only stable container in the /emergency/ service.

MikeTheCanuck commented 7 years ago

Status of container health: still not healthy enough to schedule into service.

Latest Travis build (126) reports a promising sign:

Running ecs-deploy.sh script...
Using image name: 845828040396.dkr.ecr.us-west-2.amazonaws.com/integration/emergency-service:latest
Current task definition: arn:aws:ecs:us-west-2:845828040396:task-definition/emergency-service:31
New task definition: arn:aws:ecs:us-west-2:845828040396:task-definition/emergency-service:32
Service updated successfully, new task definition running.

That is, the "new task definition running" is better than the 90-second timeout we are all seeing across the projects.

However, the ECS Event log for the EmerreponseService is still reporting problems that prevent the ALB from sending real requests to the newly-deployed containers:

service hacko-integration-EmerreponseService-1LC4181KR6KN5-Service-1WR6VWC6KKIEP (instance i-04e4ff1f307addd28) (port 47698) is unhealthy in target-group hacko-Targe-1V7HIUSN1UML6 due to (reason Health checks failed)

Dan advanced a hypothesis yesterday that for most of the projects with fat data in the DB, the Django app startup time is getting bogged down because all the models are pulling in their DB data at startup, and that for some projects (e.g. Budget), the init methods are configured to pull in .all data from the backing table.

Wonder if you've tried to reduce the init load for your app?

Here's the details I logged for this as a Budget issue.

BrianHGrant commented 7 years ago

Continuing to work through this issue some attempts have been made to resolve without this particular issue being resolved, however have made some improvements and will be kept:

  1. As per https://github.com/hackoregon/devops-17/issues/49 - we have switched to async workers using gevent, and the monkey patch method of wsgi.py file. Since this change there has been a increased reduction of the critical worker timeouts and all information states this is best architecture. We also reduced the number of workers to 3, to be in line on recommendations and cores.

  2. A make migrations command was added to the docker-entrypoint.sh file. This is to ensure the django models in sync with the db. This has elimted the "migrations exist not reflected in models" error appearing in startup both locally and on travis.

  3. Some other unused modules and imports were eliminated in the project. This will be an avenue to be continued. Refactoring is good. Still to be determined if is cause of startup problems.

New theory, do we need to preload? Workers may be starting up prior to the app, and dying prior to the app load. This is based on two readings.

  1. Returning to Gunicorn docs and parameters I noticed:

http://docs.gunicorn.org/en/latest/settings.html#preload-app

Description reads:

"Load application code before the worker processes are forked.

By preloading an application you can save some RAM resources as well as speed up server boot times. Although, if you defer application loading to each worker process, you can reload your application code easily by restarting workers."

Hmm, save RAM and increase server boot time?

Heroku is vary good explaining a bit more,

https://devcenter.heroku.com/articles/python-gunicorn#advanced-configuration

MikeTheCanuck commented 7 years ago

Preload is an interesting idea, but the docs don't make it clear what difference this makes to the app's init behaviour. I'd love to see a flow diagram of what exactly differs between the two scenarios. Why does loading app code before forking make enough of a startup difference vs load after fork? Does this make more difference in single-core setting than multi-core? Does this make a non-trivial difference only with > 3 workers, or is it worth 5-10 seconds of reduced startup time even for a 3-worker scenario?

The more interesting possibility that I haven't explored yet is trying to reduce the models load-up that is allegedly occurring by default. Apparently for Budget, the Django app loads our four models at startup, including the History model that has around 61 MB of data in its corresponding table (~80K rows).

Apparently loading this data from the "Internet" DB, then forming it into JSON, then spewing it out to the browser - all that can be very expensive, and if it's a blocking operation that precedes gunicorn's ability to answer /budget/ request, then perhaps the model load step needs to be refactored heavily.

BrianHGrant commented 7 years ago

So implemented a logging script as discussed here: https://github.com/hackoregon/team-budget/issues/96 . If I am implementing the makemigrations and migrate commands I do see a string of sql queries. These seem to be merely polling datatypes as opposed to actual content, and times of queries are in miliseconds:

emergency-service_1  | #### CONFIG COPY COMPLETE###
emergency-service_1  | (0.083) 
emergency-service_1  |             SELECT c.relname, c.relkind
emergency-service_1  |             FROM pg_catalog.pg_class c
emergency-service_1  |             LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
emergency-service_1  |             WHERE c.relkind IN ('r', 'v')
emergency-service_1  |                 AND n.nspname NOT IN ('pg_catalog', 'pg_toast')
emergency-service_1  |                 AND pg_catalog.pg_table_is_visible(c.oid); args=None
emergency-service_1  | (0.024) SELECT "django_migrations"."app", "django_migrations"."name" FROM "django_migrations"; args=()
emergency-service_1  | No changes detected
emergency-service_1  | (0.065) CREATE EXTENSION IF NOT EXISTS postgis; args=None
emergency-service_1  | (0.066) 
emergency-service_1  |             SELECT c.relname, c.relkind
emergency-service_1  |             FROM pg_catalog.pg_class c
emergency-service_1  |             LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
emergency-service_1  |             WHERE c.relkind IN ('r', 'v')
emergency-service_1  |                 AND n.nspname NOT IN ('pg_catalog', 'pg_toast')
emergency-service_1  |                 AND pg_catalog.pg_table_is_visible(c.oid); args=None
emergency-service_1  | (0.026) SELECT "django_migrations"."app", "django_migrations"."name" FROM "django_migrations"; args=()
emergency-service_1  | (0.072) 
emergency-service_1  |             SELECT c.relname, c.relkind
emergency-service_1  |             FROM pg_catalog.pg_class c
emergency-service_1  |             LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
emergency-service_1  |             WHERE c.relkind IN ('r', 'v')
emergency-service_1  |                 AND n.nspname NOT IN ('pg_catalog', 'pg_toast')
emergency-service_1  |                 AND pg_catalog.pg_table_is_visible(c.oid); args=None
emergency-service_1  | (0.024) SELECT "django_migrations"."app", "django_migrations"."name" FROM "django_migrations"; args=()
emergency-service_1  | Operations to perform:
emergency-service_1  |   Apply all migrations: auth, contenttypes, data, sessions
emergency-service_1  | Running migrations:
emergency-service_1  |   No migrations to apply.
emergency-service_1  | (0.061) 
emergency-service_1  |             SELECT c.relname, c.relkind
emergency-service_1  |             FROM pg_catalog.pg_class c
emergency-service_1  |             LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
emergency-service_1  |             WHERE c.relkind IN ('r', 'v')
emergency-service_1  |                 AND n.nspname NOT IN ('pg_catalog', 'pg_toast')
emergency-service_1  |                 AND pg_catalog.pg_table_is_visible(c.oid); args=None
emergency-service_1  | (0.031) SELECT "django_migrations"."app", "django_migrations"."name" FROM "django_migrations"; args=()
emergency-service_1  | (0.023) SELECT "django_content_type"."id", "django_content_type"."app_label", "django_content_type"."model" FROM "django_content_type" WHERE ("django_content_type"."model" = 'group' AND "django_content_type"."app_label" = 'auth'); args=('group', 'auth')
emergency-service_1  | (0.023) SELECT "django_content_type"."id", "django_content_type"."app_label", "django_content_type"."model" FROM "django_content_type" WHERE ("django_content_type"."model" = 'user' AND "django_content_type"."app_label" = 'auth'); args=('user', 'auth')
emergency-service_1  | (0.023) SELECT "django_content_type"."id", "django_content_type"."app_label", "django_content_type"."model" FROM "django_content_type" WHERE ("django_content_type"."model" = 'permission' AND "django_content_type"."app_label" = 'auth'); args=('permission', 'auth')
emergency-service_1  | (0.026) SELECT "auth_permission"."content_type_id", "auth_permission"."codename" FROM "auth_permission" INNER JOIN "django_content_type" ON ("auth_permission"."content_type_id" = "django_content_type"."id") WHERE "auth_permission"."content_type_id" IN (2, 3, 4) ORDER BY "django_content_type"."app_label" ASC, "django_content_type"."model" ASC, "auth_permission"."codename" ASC; args=(2, 3, 4)
emergency-service_1  | (0.025) SELECT "django_content_type"."id", "django_content_type"."app_label", "django_content_type"."model" FROM "django_content_type" WHERE "django_content_type"."app_label" = 'auth'; args=('auth',)
emergency-service_1  | (0.025) SELECT "django_content_type"."id", "django_content_type"."app_label", "django_content_type"."model" FROM "django_content_type" WHERE ("django_content_type"."model" = 'contenttype' AND "django_content_type"."app_label" = 'contenttypes'); args=('contenttype', 'contenttypes')
emergency-service_1  | (0.032) SELECT "auth_permission"."content_type_id", "auth_permission"."codename" FROM "auth_permission" INNER JOIN "django_content_type" ON ("auth_permission"."content_type_id" = "django_content_type"."id") WHERE "auth_permission"."content_type_id" IN (5) ORDER BY "django_content_type"."app_label" ASC, "django_content_type"."model" ASC, "auth_permission"."codename" ASC; args=(5,)
emergency-service_1  | (0.026) SELECT "django_content_type"."id", "django_content_type"."app_label", "django_content_type"."model" FROM "django_content_type" WHERE "django_content_type"."app_label" = 'contenttypes'; args=('contenttypes',)
emergency-service_1  | (0.023) SELECT "django_content_type"."id", "django_content_type"."app_label", "django_content_type"."model" FROM "django_content_type" WHERE ("django_content_type"."model" = 'session' AND "django_content_type"."app_label" = 'sessions'); args=('session', 'sessions')
emergency-service_1  | (0.043) SELECT "auth_permission"."content_type_id", "auth_permission"."codename" FROM "auth_permission" INNER JOIN "django_content_type" ON ("auth_permission"."content_type_id" = "django_content_type"."id") WHERE "auth_permission"."content_type_id" IN (6) ORDER BY "django_content_type"."app_label" ASC, "django_content_type"."model" ASC, "auth_permission"."codename" ASC; args=(6,)
emergency-service_1  | (0.023) SELECT "django_content_type"."id", "django_content_type"."app_label", "django_content_type"."model" FROM "django_content_type" WHERE "django_content_type"."app_label" = 'sessions'; args=('sessions',)
emergency-service_1  | (0.026) SELECT "django_content_type"."id", "django_content_type"."app_label", "django_content_type"."model" FROM "django_content_type" WHERE ("django_content_type"."model" = 'typenaturecode' AND "django_content_type"."app_label" = 'data'); args=('typenaturecode', 'data')
emergency-service_1  | (0.027) SELECT "django_content_type"."id", "django_content_type"."app_label", "django_content_type"."model" FROM "django_content_type" WHERE ("django_content_type"."model" = 'fmastats' AND "django_content_type"."app_label" = 'data'); args=('fmastats', 'data')
emergency-service_1  | (0.029) SELECT "django_content_type"."id", "django_content_type"."app_label", "django_content_type"."model" FROM "django_content_type" WHERE ("django_content_type"."model" = 'responderunit' AND "django_content_type"."app_label" = 'data'); args=('responderunit', 'data')
emergency-service_1  | (0.026) SELECT "django_content_type"."id", "django_content_type"."app_label", "django_content_type"."model" FROM "django_content_type" WHERE ("django_content_type"."model" = 'timedesc' AND "django_content_type"."app_label" = 'data'); args=('timedesc', 'data')
emergency-service_1  | (0.041) SELECT "django_content_type"."id", "django_content_type"."app_label", "django_content_type"."model" FROM "django_content_type" WHERE ("django_content_type"."model" = 'incsitfoundsub' AND "django_content_type"."app_label" = 'data'); args=('incsitfoundsub', 'data')
emergency-service_1  | (0.034) SELECT "django_content_type"."id", "django_content_type"."app_label", "django_content_type"."model" FROM "django_content_type" WHERE ("django_content_type"."model" = 'agency' AND "django_content_type"."app_label" = 'data'); args=('agency', 'data')
emergency-service_1  | (0.025) SELECT "django_content_type"."id", "django_content_type"."app_label", "django_content_type"."model" FROM "django_content_type" WHERE ("django_content_type"."model" = 'station' AND "django_content_type"."app_label" = 'data'); args=('station', 'data')
emergency-service_1  | (0.022) SELECT "django_content_type"."id", "django_content_type"."app_label", "django_content_type"."model" FROM "django_content_type" WHERE ("django_content_type"."model" = 'fma' AND "django_content_type"."app_label" = 'data'); args=('fma', 'data')
emergency-service_1  | (0.039) SELECT "django_content_type"."id", "django_content_type"."app_label", "django_content_type"."model" FROM "django_content_type" WHERE ("django_content_type"."model" = 'responder' AND "django_content_type"."app_label" = 'data'); args=('responder', 'data')
emergency-service_1  | (0.024) SELECT "django_content_type"."id", "django_content_type"."app_label", "django_content_type"."model" FROM "django_content_type" WHERE ("django_content_type"."model" = 'mutualaid' AND "django_content_type"."app_label" = 'data'); args=('mutualaid', 'data')
emergency-service_1  | (0.024) SELECT "django_content_type"."id", "django_content_type"."app_label", "django_content_type"."model" FROM "django_content_type" WHERE ("django_content_type"."model" = 'alarmlevel' AND "django_content_type"."app_label" = 'data'); args=('alarmlevel', 'data')
emergency-service_1  | (0.031) SELECT "django_content_type"."id", "django_content_type"."app_label", "django_content_type"."model" FROM "django_content_type" WHERE ("django_content_type"."model" = 'situationfound' AND "django_content_type"."app_label" = 'data'); args=('situationfound', 'data')
emergency-service_1  | (0.023) SELECT "django_content_type"."id", "django_content_type"."app_label", "django_content_type"."model" FROM "django_content_type" WHERE ("django_content_type"."model" = 'incsitfound' AND "django_content_type"."app_label" = 'data'); args=('incsitfound', 'data')
emergency-service_1  | (0.022) SELECT "django_content_type"."id", "django_content_type"."app_label", "django_content_type"."model" FROM "django_content_type" WHERE ("django_content_type"."model" = 'incident' AND "django_content_type"."app_label" = 'data'); args=('incident', 'data')
emergency-service_1  | (0.022) SELECT "django_content_type"."id", "django_content_type"."app_label", "django_content_type"."model" FROM "django_content_type" WHERE ("django_content_type"."model" = 'incidenttimes' AND "django_content_type"."app_label" = 'data'); args=('incidenttimes', 'data')
emergency-service_1  | (0.032) SELECT "django_content_type"."id", "django_content_type"."app_label", "django_content_type"."model" FROM "django_content_type" WHERE ("django_content_type"."model" = 'fireblock' AND "django_content_type"."app_label" = 'data'); args=('fireblock', 'data')
emergency-service_1  | (0.027) SELECT "django_content_type"."id", "django_content_type"."app_label", "django_content_type"."model" FROM "django_content_type" WHERE ("django_content_type"."model" = 'incsitfoundclass' AND "django_content_type"."app_label" = 'data'); args=('incsitfoundclass', 'data')
emergency-service_1  | (0.070) SELECT "auth_permission"."content_type_id", "auth_permission"."codename" FROM "auth_permission" INNER JOIN "django_content_type" ON ("auth_permission"."content_type_id" = "django_content_type"."id") WHERE "auth_permission"."content_type_id" IN (7, 40, 8, 9, 41, 11, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 24) ORDER BY "django_content_type"."app_label" ASC, "django_content_type"."model" ASC, "auth_permission"."codename" ASC; args=(7, 40, 8, 9, 41, 11, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 24)
emergency-service_1  | (0.062) SELECT "django_content_type"."id", "django_content_type"."app_label", "django_content_type"."model" FROM "django_content_type" WHERE "django_content_type"."app_label" = 'data'; args=('data',)

The idea of creating JSON keeps making me think could be related to "schema generation", the effective mapping of the api structure that is served to Swagger or other front end/browsable views of the api. Still a concept I am wrapping my head around and how the schema might be part of the application more broadly. Behind the scenes I believe it is CoreAPI that is doing the heavy lifting:

http://www.coreapi.org/

BrianHGrant commented 7 years ago

Another note is noticed most projects, mine included did not allow the alb address: 'hacko-integration-658279555.us-west-2.elb.amazonaws.com/' in allowed hosts, while did not solve problem think it might be necessary? It was included in the last succesful deploy of Emergency Response container (lucky number 13)

I added the verbose flag to the ecs-deploy call in docker-push.sh to get a read on the startup proccess in travis, looks like it takes about 190 seconds for the service to startup and return "RUNNING" but did not produce much else useful:

https://travis-ci.org/hackoregon/emergency-response-backend/builds/217838674

Also implemented the New Relic APM module in my local deploy but it doesn't seem to catch the startup tasks.

Stepping away for the day.

MikeTheCanuck commented 7 years ago

I tried the same --verbose and --timeout 300 settings in the ecs-deploy.sh script run, and I got the same "service running" result at the end of the Travis log:

+echo 'Service updated successfully, new task definition running.'
Service updated successfully, new task definition running.
+[[ 0 -gt 0 ]]
+exit 0

And yet I had no luck actually making it past the ALB Health Check for those deployed container instances:

service hacko-integration-BudgetService-16MVULLFXXIDZ-Service-1BKKDDHBU8RU4 (instance i-04e4ff1f307addd28) (port 49740) is unhealthy in target-group hacko-Targe-OPDM70EA36WQ due to (reason Request timed out)

And looking at the associated CloudWatch logs, this is the output we get from inside the container instance:

Running docker-entrypoint.sh...
[2017-04-03 00:08:02 +0000] [5] [INFO] Starting gunicorn 19.7.1
[2017-04-03 00:08:02 +0000] [5] [INFO] Listening at: http://0.0.0.0:8000 (5)
[2017-04-03 00:08:02 +0000] [5] [INFO] Using worker: sync
[2017-04-03 00:08:02 +0000] [8] [INFO] Booting worker with pid: 8
[2017-04-03 00:08:32 +0000] [5] [CRITICAL] WORKER TIMEOUT (pid:8)
[2017-04-03 00:08:32 +0000] [8] [INFO] Worker exiting (pid: 8)
[2017-04-03 00:08:35 +0000] [10] [INFO] Booting worker with pid: 10
[2017-04-03 00:09:05 +0000] [5] [CRITICAL] WORKER TIMEOUT (pid:10)
[2017-04-03 00:09:05 +0000] [10] [INFO] Worker exiting (pid: 10)
[2017-04-03 00:09:06 +0000] [12] [INFO] Booting worker with pid: 12
[2017-04-03 00:09:37 +0000] [5] [CRITICAL] WORKER TIMEOUT (pid:12)
[2017-04-03 00:09:40 +0000] [14] [INFO] Booting worker with pid: 14
[2017-04-03 00:10:10 +0000] [5] [CRITICAL] WORKER TIMEOUT (pid:14)
[2017-04-03 00:10:10 +0000] [14] [INFO] Worker exiting (pid: 14)
[2017-04-03 00:10:11 +0000] [16] [INFO] Booting worker with pid: 16
[2017-04-03 00:10:42 +0000] [5] [CRITICAL] WORKER TIMEOUT (pid:16)
[2017-04-03 00:10:42 +0000] [16] [INFO] Worker exiting (pid: 16)
[2017-04-03 00:10:43 +0000] [18] [INFO] Booting worker with pid: 18
BrianHGrant commented 7 years ago

We are up with new code deploying and validated schema. Same error as budget team, lack of the elb address in allowed hosts.

MikeTheCanuck commented 7 years ago

Nicely done Brian! I am 80% sure Budget has a stable deploy now too. The deploy landed and started last night but it was experiencing ALLOWED_HOSTS timeouts for ALB which meant the containers weren't passing health check.

But somehow that got worked on it's own by the time I awoke today, and last I checked it was still operational (though the underlying app needs some heavy optimisation).

BrianHGrant commented 7 years ago

Thanks, glad to be moving forward on the projects. These links should explain the white page instead of the Swagger frontend when Debug=True (whats being hosted at http://hacko-integration-658279555.us-west-2.elb.amazonaws.com/budget/)

https://docs.djangoproject.com/en/1.10/howto/static-files/

https://devcenter.heroku.com/articles/django-assets

http://whitenoise.evans.io/en/stable/