ministryofjustice / analytical-platform

Analytical Platform • This repository is defined and managed in Terraform
https://docs.analytical-platform.service.justice.gov.uk
MIT License
12 stars 4 forks source link

SPIKE- Investigation on Mapping all logging to Grafana Dashboard #2034

Closed PriyaBasker23 closed 10 months ago

PriyaBasker23 commented 1 year ago

User Story

As a data producer / controller, I want to observe the entire data flow, beginning from registration and concluding when the data becomes accessible for users

Value / Purpose

Every component in the data-as-product pipeline is generating logs in CloudWatch, and these logs are connected to Grafana. By utilizing these logs, we can configure a dashboard that will provide us with a visual representation of the data flow.

Hypothesis

For logging and monitoring (MVP), we will make use of CloudWatch logs and Grafana.

Checklist

Definition of Done

MatMoore commented 1 year ago

Reference for grafana visualisations: https://grafana.com/docs/grafana/latest/panels-visualizations/visualizations/

Reference for cloudwatch data source: https://grafana.com/docs/grafana/latest/datasources/aws-cloudwatch/

Cloudwatch query syntax reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_QuerySyntax.html

I'm imagining something like this:

Image

We need to discuss how the cloudwatch logs query would work before we can move it into grafana. As far as I can see, we don't have data product as a field we can filter on at the moment, since we're logging unstructured data.

This is the list of fields cloudwatch insights discovers now: Image

This guide recommends using structlog to output structured data instead: https://docs.aws.amazon.com/lambda/latest/operatorguide/parse-logs.html

I'll do some more experimenting in AWS console tomorrow.

MatMoore commented 1 year ago

Corresponding apps-and-tools ticket: https://github.com/ministryofjustice/data-platform/issues/1888

MatMoore commented 1 year ago

Parked ticket until the second half of the sprint, by which point we hope to have a grafana instance to play with.

Some other notes for later:

MatMoore commented 1 year ago

We want to be able to see

Could we show multiple datasets as series on the same timeline?

Is it possible to click a point in the graph and drill down to logs for that dataset/lambda?

Events we could potentially visualise per ingestion:

Can start with the s3 events perhaps.

MatMoore commented 1 year ago

https://grafana.com/grafana/plugins/grafana-athena-datasource/#:~:text=and%20variables.-,Annotations,-Annotations%20allow%20you

We might be able to show PutObject events as annotations on an empty time series

MatMoore commented 1 year ago

Still blocked on this until https://github.com/ministryofjustice/data-platform/issues/2145 is done, but in the meantime I'm having a look at grafana docs and athena queries.

I think the state timeline may work for us. The state would be something like {landing, raw, curated}, and there would be a series for each data product version uploaded in the time range.

I need to try this with actual data though.

MatMoore commented 1 year ago

For linking visualisations together, look into data links: https://grafana.com/docs/grafana/latest/panels-visualizations/configure-data-links/#data-links

MatMoore commented 1 year ago

We don't have detailed cloudwatch metrics enabled for API gateway, but this would probably be very useful for understanding API usage.

Some more resources:

MatMoore commented 1 year ago

Summary of findings

Reference information

Best practice information and case studies

Videos for learning grafana

Example dashboards

Data platform ingestion

Top row shows a flow of data through the buckets based on PutObject cloudtrail events, so you can see how much is getting through.

Some caveats: it seems like we are missing events from the raw zone for some reason, and the fail zone has no data because nothing is being written there at the moment. Potentially this could be visualised altogether as a sankey chart showing the whole flow but I went with simple numbers to start with.

This won't show how long data takes to flow through the system. To understand that I think we would need a data source that is able to correlate requests between the different lambdas, like X-ray perhaps.

Below that I experimented with API Gateway metrics, but it seems like we are not collecting this data, so we would have to enable that in terraform. But the idea is we can surface any errors and latency issues with the API. Similarly we could add panels to monitor for lambda executions that exceeded their max number of retries (meaning we will have lost data).

Data product view

This demonstrates how we might drill down into specific data products, but the dashboard itself is just showing the put object events again. I was thinking we could show some kind of timeline of uploads here but I didn't come up with a nice way to visualise it yet. I think it would be possible to graph it like this where we group logs into 5m bins and show the count for each one

Latest logs

This brings in the logs from the lambdas and displays them as a list. The data product name has to be entered manually as we don't have a good data source for fetching all the names of data products atm. The filtering relies on the lambdas being updated to use structlog, so presigned url works atm but others may not.

Things to present in show & tell

General observations

MatMoore commented 12 months ago

Why are we missing logs for PutObject events into raw/ ?

This doesn't seem to be a problem with the query, as it's missing from the cloudtrail

jq '.Records[] | .requestParameters.key' < 013433889002_CloudTrail_eu-west-2_20231122T0005Z_OAPLwzzErFVj94wT.json
"landing/hmpps_use_of_force/v1/report_log/load_timestamp=20231122T000033Z/25f9e9cd-82a8-4272-a244-ef20af550cc5.csv"
"landing/hmpps_use_of_force/v1/knex_migrations_lock/load_timestamp=20231122T000035Z/f8c15a56-c72b-4dbb-80dc-14ef5e5a512a.csv"
"landing/hmpps_use_of_force/v1/knex_migrations/load_timestamp=20231122T000036Z/a0130c4a-e715-4354-9927-31506ba0188d.csv"
"landing/hmpps_use_of_force/v1/report/load_timestamp=20231122T000038Z/994f3ab3-9924-4d23-8a16-978710cbd3da.csv"
"landing/hmpps_use_of_force/v1/statement_amendments/load_timestamp=20231122T000041Z/eaa1f31b-d985-45ab-ad3b-a7d7c8645209.csv"
"landing/hmpps_use_of_force/v1/statement/load_timestamp=20231122T000039Z/61fe9380-311f-4640-8642-859a8f231ec3.csv"
"curated/hmpps_use_of_force/v1/report_log/extraction_timestamp=20231122T000033Z/20231122_000046_00005_67pwh_442732aa-0e1a-47b4-b9b4-bf7ce3fd9346"
"curated/hmpps_use_of_force/v1/knex_migrations_lock/extraction_timestamp=20231122T000035Z/20231122_000047_00000_55mve_020b4a8c-96b6-48db-bfa4-c248f51a1cbb"
"curated/hmpps_use_of_force/v1/knex_migrations/extraction_timestamp=20231122T000036Z/20231122_000052_00000_nij5v_b0ef53c0-6ea0-42f7-8328-1489c686b92a"
"curated/hmpps_use_of_force/v1/report/extraction_timestamp=20231122T000038Z/20231122_000049_00000_evzqa_a660eb43-72a8-443f-af3a-cea0a967c70e"

Possibly this is due to use of the copy API instead of PutObject.

MatMoore commented 12 months ago

Another slight problem with relying on cloudtrail logs is that it takes up to 5 minutes for events to be delivered

So this means the overview numbers are going to be up to 5 minutes stale. I think this is probably good enough but might be a source for confusion.

MatMoore commented 12 months ago

When we call client.copy in boto, objects above 8mb use a multipart upload. So the events we can monitor will either be CopyObject OR the MultipartUpload ones.

See https://github.com/boto/s3transfer/blob/cc9345a3a0c64b193c25f5e014ba4faeefe7bbc7/s3transfer/copies.py#L133C7-L133C7

and https://docs.aws.amazon.com/AmazonS3/latest/userguide/cloudtrail-logging-s3-info.html#cloudtrail-object-level-tracking