PriyaBasker23 commented 1 year ago

User Story

As a data producer / controller, I want to observe the entire data flow, beginning from registration and concluding when the data becomes accessible for users

Value / Purpose

Every component in the data-as-product pipeline is generating logs in CloudWatch, and these logs are connected to Grafana. By utilizing these logs, we can configure a dashboard that will provide us with a visual representation of the data flow.

Hypothesis

For logging and monitoring (MVP), we will make use of CloudWatch logs and Grafana.

Checklist

[x] Investigate and gain a basic understanding of dashboarding functionality.
[x] Develop a straightforward example illustrating the data flow journey within the dashboard.
[x] Document additional features that could be integrated for improvement.
[x] Identify challenging aspects that could be enhanced in the future.

Definition of Done

[x] Create an extremely straightforward Grafana dashboard that visualizes the user flow.
[ ] Demonstrate and explain the dashboard's functionality. ( show and tell)

MatMoore commented 1 year ago

Reference for grafana visualisations: https://grafana.com/docs/grafana/latest/panels-visualizations/visualizations/

Reference for cloudwatch data source: https://grafana.com/docs/grafana/latest/datasources/aws-cloudwatch/

Cloudwatch query syntax reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_QuerySyntax.html

I'm imagining something like this:

We need to discuss how the cloudwatch logs query would work before we can move it into grafana. As far as I can see, we don't have data product as a field we can filter on at the moment, since we're logging unstructured data.

This is the list of fields cloudwatch insights discovers now:

This guide recommends using structlog to output structured data instead: https://docs.aws.amazon.com/lambda/latest/operatorguide/parse-logs.html

I'll do some more experimenting in AWS console tomorrow.

MatMoore commented 1 year ago

Corresponding apps-and-tools ticket: https://github.com/ministryofjustice/data-platform/issues/1888

MatMoore commented 1 year ago

Parked ticket until the second half of the sprint, by which point we hope to have a grafana instance to play with.

Some other notes for later:

We don't have to use cloudwatch as the data source, since we also have structured data in athena select * from lambdas where lambda_name='data_product_presigned_url_development' and data_product_name='example_prison_data_product' limit 100
athena is probably a better starting point although it would be nice to investigate if we can improve the logs in cloudwatch as well (either print the json to stdout in our logger, or switchover to the structlog library)
Athena plugin docs are here: https://grafana.com/grafana/plugins/grafana-athena-datasource/
it's possible to build metrics in cloudwatch as well, and feed that into grafana, but not sure we need this atm

MatMoore commented 1 year ago

We want to be able to see

whether data has gone through all steps in the pipeline, or only made it part way through
how long it takes to get from stage to stage

Could we show multiple datasets as series on the same timeline?

Is it possible to click a point in the graph and drill down to logs for that dataset/lambda?

Events we could potentially visualise per ingestion:

PutObject in landing bucket (one event)
Landing to raw triggered (one log line)
PutObject in raw bucket (one event)
Landing to raw completes successfully (one log line)
Athena load triggers (one log line)
PutObject in curated bucket (many events)
Athena load completes an ingestion (one of several scenarios)

Can start with the s3 events perhaps.

MatMoore commented 1 year ago

https://grafana.com/grafana/plugins/grafana-athena-datasource/#:~:text=and%20variables.-,Annotations,-Annotations%20allow%20you

We might be able to show PutObject events as annotations on an empty time series

MatMoore commented 1 year ago

Still blocked on this until https://github.com/ministryofjustice/data-platform/issues/2145 is done, but in the meantime I'm having a look at grafana docs and athena queries.

I think the state timeline may work for us. The state would be something like {landing, raw, curated}, and there would be a series for each data product version uploaded in the time range.

I need to try this with actual data though.

MatMoore commented 1 year ago

For linking visualisations together, look into data links: https://grafana.com/docs/grafana/latest/panels-visualizations/configure-data-links/#data-links

MatMoore commented 1 year ago

We don't have detailed cloudwatch metrics enabled for API gateway, but this would probably be very useful for understanding API usage.

Some more resources:

Linking dashboards together: https://grafana.com/docs/grafana/latest/dashboards/build-dashboards/manage-dashboard-links/
What to visualise: https://grafana.com/docs/grafana/latest/dashboards/build-dashboards/best-practices/ and https://grafana.com/blog/2022/06/06/grafana-dashboards-a-complete-guide-to-all-the-different-types-you-can-build/
General tips around display options https://www.percona.com/blog/designing-grafana-dashboards/

MatMoore commented 1 year ago

Summary of findings

Reference information

Grafana visualisation types
Cloudwatch data source
Athena data source
Cloudwatch query syntax (same as Cloudwatch Logs Insights)
How to link dashboards

Best practice information and case studies

Videos for learning grafana

https://www.youtube.com/watch?v=-8Wn2UzVFzI - just walks through creating a dashboard with the cloudwatch source
https://www.youtube.com/playlist?list=PLKWUX7aMnlELvU9_qEeqkqBDFzWOc5iju more generic videos on specific topics

Example dashboards

Data platform ingestion

Top row shows a flow of data through the buckets based on PutObject cloudtrail events, so you can see how much is getting through.

Some caveats: it seems like we are missing events from the raw zone for some reason, and the fail zone has no data because nothing is being written there at the moment. Potentially this could be visualised altogether as a sankey chart showing the whole flow but I went with simple numbers to start with.

This won't show how long data takes to flow through the system. To understand that I think we would need a data source that is able to correlate requests between the different lambdas, like X-ray perhaps.

Below that I experimented with API Gateway metrics, but it seems like we are not collecting this data, so we would have to enable that in terraform. But the idea is we can surface any errors and latency issues with the API. Similarly we could add panels to monitor for lambda executions that exceeded their max number of retries (meaning we will have lost data).

Data product view

This demonstrates how we might drill down into specific data products, but the dashboard itself is just showing the put object events again. I was thinking we could show some kind of timeline of uploads here but I didn't come up with a nice way to visualise it yet. I think it would be possible to graph it like this where we group logs into 5m bins and show the count for each one

Latest logs

This brings in the logs from the lambdas and displays them as a list. The data product name has to be entered manually as we don't have a good data source for fetching all the names of data products atm. The filtering relies on the lambdas being updated to use structlog, so presigned url works atm but others may not.

Things to present in show & tell

The cloudwatch query syntax
Logs vs metrics
What's there (anything in cloudwatch) and what isn't (e.g. athena, some metrics)
Changes I made to instrumentation: putobject cloudtrail log group + structlog
How template variables work
Basic visualisation types (time series, stat, bar chart)
Query annotations (did not work with logs)

General observations

The query interface is different depending on which data source you are using. When adding a new panel you need to work out what result set you need to power the visualisation.
The cloudwatch query interface is not very intuitive. Don't forget to select the log group.
Most visualisations assume a numeric field of some sort and a timestamp. If you leave out the timestamp they don't work
A few visualisation types let you visualise categorical data (e.g. counts of things per category)
The stat visualisation is useful for aggregating into a single number
There is a canvas visualisation that lets you display fields from a result set within a larger boxes and arrows diagram. Seems cool but don't have a use for it
There's an additional "transformations" panel within the query interface. I think this is a newish feature in grafana that runs additional transformations after the query part, but I found it quite confusing to understand what they all do.
It's possible to link dashboards together, but I didn't look into this in detail.

MatMoore commented 12 months ago

Why are we missing logs for PutObject events into raw/ ?

This doesn't seem to be a problem with the query, as it's missing from the cloudtrail

jq '.Records[] | .requestParameters.key' < 013433889002_CloudTrail_eu-west-2_20231122T0005Z_OAPLwzzErFVj94wT.json
"landing/hmpps_use_of_force/v1/report_log/load_timestamp=20231122T000033Z/25f9e9cd-82a8-4272-a244-ef20af550cc5.csv"
"landing/hmpps_use_of_force/v1/knex_migrations_lock/load_timestamp=20231122T000035Z/f8c15a56-c72b-4dbb-80dc-14ef5e5a512a.csv"
"landing/hmpps_use_of_force/v1/knex_migrations/load_timestamp=20231122T000036Z/a0130c4a-e715-4354-9927-31506ba0188d.csv"
"landing/hmpps_use_of_force/v1/report/load_timestamp=20231122T000038Z/994f3ab3-9924-4d23-8a16-978710cbd3da.csv"
"landing/hmpps_use_of_force/v1/statement_amendments/load_timestamp=20231122T000041Z/eaa1f31b-d985-45ab-ad3b-a7d7c8645209.csv"
"landing/hmpps_use_of_force/v1/statement/load_timestamp=20231122T000039Z/61fe9380-311f-4640-8642-859a8f231ec3.csv"
"curated/hmpps_use_of_force/v1/report_log/extraction_timestamp=20231122T000033Z/20231122_000046_00005_67pwh_442732aa-0e1a-47b4-b9b4-bf7ce3fd9346"
"curated/hmpps_use_of_force/v1/knex_migrations_lock/extraction_timestamp=20231122T000035Z/20231122_000047_00000_55mve_020b4a8c-96b6-48db-bfa4-c248f51a1cbb"
"curated/hmpps_use_of_force/v1/knex_migrations/extraction_timestamp=20231122T000036Z/20231122_000052_00000_nij5v_b0ef53c0-6ea0-42f7-8328-1489c686b92a"
"curated/hmpps_use_of_force/v1/report/extraction_timestamp=20231122T000038Z/20231122_000049_00000_evzqa_a660eb43-72a8-443f-af3a-cea0a967c70e"

Possibly this is due to use of the copy API instead of PutObject.

MatMoore commented 12 months ago

Another slight problem with relying on cloudtrail logs is that it takes up to 5 minutes for events to be delivered

So this means the overview numbers are going to be up to 5 minutes stale. I think this is probably good enough but might be a source for confusion.

MatMoore commented 12 months ago

When we call client.copy in boto, objects above 8mb use a multipart upload. So the events we can monitor will either be CopyObject OR the MultipartUpload ones.

See https://github.com/boto/s3transfer/blob/cc9345a3a0c64b193c25f5e014ba4faeefe7bbc7/s3transfer/copies.py#L133C7-L133C7

and https://docs.aws.amazon.com/AmazonS3/latest/userguide/cloudtrail-logging-s3-info.html#cloudtrail-object-level-tracking

ministryofjustice / analytical-platform

SPIKE- Investigation on Mapping all logging to Grafana Dashboard #2034

User Story

Value / Purpose

Hypothesis

Checklist

Definition of Done

Summary of findings

Reference information

Best practice information and case studies

Videos for learning grafana

Example dashboards

Data platform ingestion

Data product view

Latest logs

Things to present in show & tell

General observations