hackoregon / civic-devops

Master collection point for issues, procedures, and code to manage the HackOregon Civic platform
MIT License
11 stars 4 forks source link

Application monitoring needed to catch when APIs are down (low cost, alerting) #213

Open MikeTheCanuck opened 6 years ago

MikeTheCanuck commented 6 years ago

Turns out that we took out the Housing-Affordability APIs during an infrastructure change. Shame on me for missing that - it was a subtle mistake and after as many changes that had no destructive impact, I just didn't thoroughly evaluate the API health.

This is normal, and there's no way any distributed system like ours should have to rely on humans to remember to validate every piece of the stack every time a change rolls out.

We need to find a low- or no-cost monitoring solution of production assets that lets us achieve the following:

nam20485 commented 6 years ago

Just off the top of my head I can see at least three options:

1.) Pay $ to create and run AWS' own builtin monitoring and alert notification functionality. ($ being the primary barrier here obviously)

2.) Provision another ECS instance and install/adminstrate our own services to do monitoring/notification.

3.) Paid commercial offerings like New Relic or similar.

Actually it looks like most options have $ as an obstacle so it might be more of a decision about which is most cost-effective.

Alerting @znmeb because I believe he has professional experience in this area.

On Thu, Jul 19, 2018, 9:01 AM Mike Lonergan notifications@github.com wrote:

Turns out that we took out the Housing-Affordability APIs during an infrastructure change. Shame on me for missing that - it was a subtle mistake and after as many changes that had no destructive impact, I just didn't thoroughly evaluate the API health.

This is normal, and there's no way any distributed system like ours should have to rely on humans to remember to validate every piece of the stack every time a change rolls out.

We need to find a low- or no-cost monitoring solution of production assets that lets us achieve the following:

  • validate the health of each endpoint similarly to our smoke tests - e.g. do we get a 200 from each container (aka is the web server running)? Do we receive compliant JSON from each endpoint (aka is the Django app answering with something it got from the database)?
  • canary queries - e.g. is there a specific query for each endpoint that will remain mostly stable, and will demonstrate that the database is returning expected data?
  • validate the health of the React apps - e.g. do we get a 200 from each React app? Do we receive a reasonable "HTML response" (or some other lightweight way to show the React app is sending valid data to the requesting browser)?
  • validate the database listener - do we get a response on 5432? Is there a way to show that each database is up and responding (without having to hard-code creds in our testing harness)?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hackoregon/civic-devops/issues/213, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvqFD9ApL-XqrmSteJwuK7pdPVwJvMVks5uIK1ZgaJpZM4VWrIu .

bhgrant8 commented 6 years ago

Similar thoughts to what I expressed in this issue: https://github.com/hackoregon/civic-devops/issues/211 in regards to creating an acceptable testing framework

As mentioned previously, I believe this is a very good use case for Lambda functions which then could be triggered manually as well as attached to a CloudWatch scheduled event to say run automatically once a day/whatever acceptable level of service we define.

System Works as follows:

  1. Create lambda function which makes simple python requests to the services/websites. We should be able to check the response of api as well as website requests.

  2. Deploy lambdas into our VPC and hook to our existing NAT/networking infrastructure to allow these lambdas to make outgoing requests and receive response

  3. Lambda supports KMS encryption of environmental variables for any secure settings that needs to be passed in for database testing.

  4. Based on the status code/response of the requests and alerting parameters we define, lambda can then pass on alerting information payload to the alerting system we choose.

I have a similar system built at my job for monitoring our services. These Lambdas run between every 1-5 minutes, 24hrs a day, so quite a bit more then we would need to run on our system but current costs are:

$0.12 /a month for lambdas requests over free limit $1.90 /a month for KMS requests over free limit $9.15 /a month for an ECS instance to host a manually configured NAT ( we would not need this if we can connect to our existing infrastructure)

So essentially, we would be looking at potentially free to a few bucks a month for the monitoring portion.

In terms of alerting two free/low-cost options:

  1. maybe we setup a separate slack instance for deploy alerts, so as to not affect integration/messaging limits on the main slack?

  2. Setup some other notification service, such as Amazon SNS, or possibly a free account with my company (under 1000 users is free, and would have full access to the services we would need).

just some thoughts to get ideas going. Here are a few blog posts, where the users have used node to accomplish a similar thing:

https://hackernoon.com/creating-a-website-monitoring-service-in-half-an-hour-using-lambdas-4f64fb199df3

http://marcelog.github.io/articles/aws_lambda_check_website_http_online.html

bhgrant8 commented 6 years ago

@nam20485 's comments also get to the idea of real-time monitoring of our system vs. having a certain level of regression tests/ ping-type monitoring, somewhere related to https://github.com/hackoregon/civic-devops/issues/196.

if we look at option 2, which has the most human capital cost, but potential low monetary cost (though probably at least another ec2 + traffic costs):

some open source options/probably used in conjunction:

https://grafana.com/ https://prometheus.io/

and then cadvisor is useful looking at the container level:

https://github.com/google/cadvisor

Without an SLA on uptime, we should at least understand our own thoughts on acceptance rates/downtime vs. development time + human + monetary cost.

alerting fatigue is hard enough when you get paid.

bhgrant8 commented 6 years ago

Going along with the prometheus/grafana option, found the experience in this blog fairly enlightening:

https://runnable.com/blog/how-we-saved-98-on-infrastructure-monitoring-costs

MikeTheCanuck commented 6 years ago

Thanks for these research and recommendations @bhgrant8 - it's helpful to know what "low-cost" and "self-supported" options look like these days.

Without deeply investigating/testing out each of these ideas, my instincts tell me that - for a setup that barely has part-time coverage during the active parts of the season - building and maintaining another software platform - used to monitor the health and performance of the other software platforms - is not a great trade-off.

Given the completely volunteer and unreliable nature of our devops contributors (myself included), we need a solution that's hands-free (costs next to no time or cognitive burden to build, configure, evolve, support). We have a hard enough time remembering to factor in all the pieces of our infrastructure as it is when changes are made; another piece of software that we have to keep alive due to our own hands (not something run by others) is challenging to swallow. It's not that it isn't fun to try out new things - but having to "own" another thing from the OS layer up is a measurable burden.

Agree with you on the alert fatigue - I'm an OCD bastard with my email notifications, and I still can't keep up with what pours in every day. (Clearly - took me a few days to get back to this hot topic.)

Agree with you @nam20485 on the cost burden - but without even access to anything resembling a budget (let alone insight into what the organization's monthly payments are like right now), I do not in any way feel empowered to add more financial burden to the organization. Nor do I have any expectation that I'll know more on this front any time soon.

bhgrant8 commented 6 years ago

@MikeTheCanuck, @nam20485, @DingoEatingFuzz, so while I am on board with the idea that we should minimize cost as well as work done by a few isolated folks in the org, I also don't think putting off the idea of monitoring as well as testing is necessarily the best plan. especially with a few months of off-season, as well as some potential commits from hacktoberfest. Let's be proactive, pay down some of the technical debt, and stop troubleshooting issues that should have been figured out when 100+ folks are trying to deploy code.

Maybe instead of starting with the tech side, which I am too ready to do, we ask ourselves what from a user perspective should we be monitoring, and build out tooling from that point, ie, as a user i would expect i could pull data at least once in a 24 hour period vs. I need realtime access to all the datas. Build out some service level objectives for our organization.

Additionally, creating some processes/objectives around transparency and decision making as an organization maybe a good thing, I think many of us in the organization are feeling a similar feeling regarding your last few paragraphs, having some processes around change and review would benefit us all.

Frankly I'd be more willing to start building solutions vs. open github issues, if we had some of this in place, and more then willing to help build this process.