[Spike] As a Tech Lead, I want to get alerts when there is a backend or frontend error that affects an STT user

alexsoble commented 3 years ago

Notes

Research:

Should likely have sub-tickets for frontend and backend errors
Security Control AU-02 relates to logging errors, this relates to alerting
Also relates to our DevOps practice
This will become more relevant as we move to a wider userbase and as opposed to closely-observed user testing

alexsoble commented 3 years ago

Unlikely to be needed for ATO, possibly an issue for Tribal MVP

amilash commented 3 years ago

@abottoms-coder What kind of errors would you need for suer submissions in V1,V2 and or V3? We're not sure where to put this. It needs scoping.

andrew-jameson commented 3 years ago

@amilash Indeed, this needs scoping. At the present moment, I don't think we can accurately capture which errors/alerts are relevant given that we probably haven't even written the code that would error out.

Above and beyond specifics, I don't believe we have an alerting system in place. We'd need to implement an e-mail relay within our buildpack setups to send these alerts from something like system@tdp.cloud.gov. Ideally, I think the e-mail system should e-mail anyone with role sysadmin. Beyond these minor thoughts, I think we'd need a brainstorming session or two to come up with the paradigm used much less error-by-error.

For v1, I jotted down some thoughts as follows:

file not uploaded correctly (either actual upload failure or backend linking failure)
user logged in without associated STT
user account locked out due to repeated login failures (does login.gov even do this?)
any uncaught exceptions/stacktraces
any logging line with syslog level of ERROR and FATAL (we can scale this back as time goes on if it becomes spammy)
- If it becomes spammy, we should probably resolve or downgrade such error messages first

amilash commented 3 years ago

Im going to slate this for V3 since thats when we will have users.

ADPennington commented 3 years ago

re: users --

users currently receive a popup when they attempt to upload a file that doesnt have the correct format, but no alert when they attempt to upload a blank file or a file that does not pass security inspection. the submit button doesnt work in these scenarios and seems relevant to tribal mvp.

amilash commented 3 years ago

I think can be considered as part of the larger epic we are planning for which is "TDP Automated Communications/Notifications. I'll link it to our board.

andrew-jameson commented 2 years ago

Solutions:

Sentry
Logstash -- integrated often with Elasticsearch
Prometheus
Kibana -- already in use
New Relic Please note we have a kibana dashboard already it just might need configuration we can't do: https://logs.fr.cloud.gov/
Datadog

jtimpe commented 8 months ago

https://cloud.gov/docs/ops/repos/#repositories

In this document, Cloud.gov lists 'New Relic' as one of the supported 'BOSH releases'

https://github.com/cloudfoundry-community/newrelic-boshrelease

'Monitoring' is also listed under 'Deployment pipelines', which links to this promethius deployment

https://github.com/cloud-gov/cg-deploy-prometheus

andrew-jameson commented 8 months ago

Per @stevenino should just hone in on PLG stack and pivot if issues arise.

jtimpe commented 8 months ago

from cloud.gov re: monitoring service instances

Currently cloud.gov customers do not have direct access to the logs for their service instances (RDS, ElasticSearch, etc) however we understand this is a requested customer feature that is on our roadmap. The current route for customers to obtain access to their service instance logs, if logs are enabled for that specific service instance (in this case your elasticsearch instance), is to send a request to support@cloud.gov for the logs for your specific service instance.

I asked for clarification on this point

Also, to clarify, “currently, cloud.gov customers do not have direct access to the logs for their service instances” – would this include if we configured a monitoring service, like promethius, inside the deployment space?

Response:

Currently customer service instance logs (RDS, Elasticsearch, etc) are not exposed to customers or the customer deployment space, as such any monitoring service would not have access to your service instance logs.

seems we would be limited to only application logs.

raftmsohani commented 6 months ago

DD docker example for python web app: https://github.com/DataDog/docker-compose-example
Sentry starter: https://docs.sentry.io/product/relay/getting-started/#sending-a-test-event
Logstash: https://hub.docker.com/_/logstash

robgendron commented 6 months ago

Nearing completion - will provide team with documentation and table top to showcase discovery (5/29).

robgendron commented 6 months ago

Waiting on Data Dog for presentation.

raftmsohani commented 6 months ago

SENTRY self hosted requirement are mentioned here: https://develop.sentry.dev/self-hosted/

It is mentioned:

2 CPU cores 4 GB RAM

raftmsohani commented 6 months ago

For Prometheus, I used this installation manual: https://github.com/korfuri/django-prometheus?tab=readme-ov-file We had to install it locally. One difference from Logstash and Sentry is: Prometheus pulls the data from the server instead of pushing from server to prometheus. This might need more attention on the security since we will have to open up a port on the monitored app for the Prometheus to be able to see the logs endpoint.

andrew-jameson commented 6 months ago

Will wrap up next week w/ DataDog demo Tuesday. Mo also has done great work on multiple proof of concepts. With all these in, we will discuss path forward as a team with pros/cons, etc during office hours or a one-off meeting.

robgendron commented 5 months ago

DataDog meeting is now Thursday.

raftmsohani commented 5 months ago

For comparison see this

robgendron commented 5 months ago

Work is complete, need to decide course of action for the future.

raftmsohani commented 5 months ago

A nice video explaining SENTRY capabilities: https://youtu.be/4djseRVSan8?si=KlElkQQN_7zwoaEj

robgendron commented 5 months ago

Deemed closed, spin off tickets are being generated.

raft-tech / TANF-app

[Spike] As a Tech Lead, I want to get alerts when there is a backend or frontend error that affects an STT user #831

Notes