Closed PriyaBasker23 closed 10 months ago
Reference for grafana visualisations: https://grafana.com/docs/grafana/latest/panels-visualizations/visualizations/
Reference for cloudwatch data source: https://grafana.com/docs/grafana/latest/datasources/aws-cloudwatch/
Cloudwatch query syntax reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_QuerySyntax.html
I'm imagining something like this:
We need to discuss how the cloudwatch logs query would work before we can move it into grafana. As far as I can see, we don't have data product as a field we can filter on at the moment, since we're logging unstructured data.
This is the list of fields cloudwatch insights discovers now:
This guide recommends using structlog to output structured data instead: https://docs.aws.amazon.com/lambda/latest/operatorguide/parse-logs.html
I'll do some more experimenting in AWS console tomorrow.
Corresponding apps-and-tools ticket: https://github.com/ministryofjustice/data-platform/issues/1888
Parked ticket until the second half of the sprint, by which point we hope to have a grafana instance to play with.
Some other notes for later:
select * from lambdas where lambda_name='data_product_presigned_url_development' and data_product_name='example_prison_data_product' limit 100
We want to be able to see
Could we show multiple datasets as series on the same timeline?
Is it possible to click a point in the graph and drill down to logs for that dataset/lambda?
Events we could potentially visualise per ingestion:
Can start with the s3 events perhaps.
We might be able to show PutObject events as annotations on an empty time series
Still blocked on this until https://github.com/ministryofjustice/data-platform/issues/2145 is done, but in the meantime I'm having a look at grafana docs and athena queries.
I think the state timeline may work for us. The state would be something like {landing, raw, curated}, and there would be a series for each data product version uploaded in the time range.
I need to try this with actual data though.
For linking visualisations together, look into data links: https://grafana.com/docs/grafana/latest/panels-visualizations/configure-data-links/#data-links
We don't have detailed cloudwatch metrics enabled for API gateway, but this would probably be very useful for understanding API usage.
Some more resources:
Top row shows a flow of data through the buckets based on PutObject cloudtrail events, so you can see how much is getting through.
Some caveats: it seems like we are missing events from the raw zone for some reason, and the fail zone has no data because nothing is being written there at the moment. Potentially this could be visualised altogether as a sankey chart showing the whole flow but I went with simple numbers to start with.
This won't show how long data takes to flow through the system. To understand that I think we would need a data source that is able to correlate requests between the different lambdas, like X-ray perhaps.
Below that I experimented with API Gateway metrics, but it seems like we are not collecting this data, so we would have to enable that in terraform. But the idea is we can surface any errors and latency issues with the API. Similarly we could add panels to monitor for lambda executions that exceeded their max number of retries (meaning we will have lost data).
This demonstrates how we might drill down into specific data products, but the dashboard itself is just showing the put object events again. I was thinking we could show some kind of timeline of uploads here but I didn't come up with a nice way to visualise it yet. I think it would be possible to graph it like this where we group logs into 5m bins and show the count for each one
This brings in the logs from the lambdas and displays them as a list. The data product name has to be entered manually as we don't have a good data source for fetching all the names of data products atm. The filtering relies on the lambdas being updated to use structlog, so presigned url works atm but others may not.
Why are we missing logs for PutObject events into raw/ ?
This doesn't seem to be a problem with the query, as it's missing from the cloudtrail
jq '.Records[] | .requestParameters.key' < 013433889002_CloudTrail_eu-west-2_20231122T0005Z_OAPLwzzErFVj94wT.json
"landing/hmpps_use_of_force/v1/report_log/load_timestamp=20231122T000033Z/25f9e9cd-82a8-4272-a244-ef20af550cc5.csv"
"landing/hmpps_use_of_force/v1/knex_migrations_lock/load_timestamp=20231122T000035Z/f8c15a56-c72b-4dbb-80dc-14ef5e5a512a.csv"
"landing/hmpps_use_of_force/v1/knex_migrations/load_timestamp=20231122T000036Z/a0130c4a-e715-4354-9927-31506ba0188d.csv"
"landing/hmpps_use_of_force/v1/report/load_timestamp=20231122T000038Z/994f3ab3-9924-4d23-8a16-978710cbd3da.csv"
"landing/hmpps_use_of_force/v1/statement_amendments/load_timestamp=20231122T000041Z/eaa1f31b-d985-45ab-ad3b-a7d7c8645209.csv"
"landing/hmpps_use_of_force/v1/statement/load_timestamp=20231122T000039Z/61fe9380-311f-4640-8642-859a8f231ec3.csv"
"curated/hmpps_use_of_force/v1/report_log/extraction_timestamp=20231122T000033Z/20231122_000046_00005_67pwh_442732aa-0e1a-47b4-b9b4-bf7ce3fd9346"
"curated/hmpps_use_of_force/v1/knex_migrations_lock/extraction_timestamp=20231122T000035Z/20231122_000047_00000_55mve_020b4a8c-96b6-48db-bfa4-c248f51a1cbb"
"curated/hmpps_use_of_force/v1/knex_migrations/extraction_timestamp=20231122T000036Z/20231122_000052_00000_nij5v_b0ef53c0-6ea0-42f7-8328-1489c686b92a"
"curated/hmpps_use_of_force/v1/report/extraction_timestamp=20231122T000038Z/20231122_000049_00000_evzqa_a660eb43-72a8-443f-af3a-cea0a967c70e"
Possibly this is due to use of the copy API instead of PutObject.
Another slight problem with relying on cloudtrail logs is that it takes up to 5 minutes for events to be delivered
So this means the overview numbers are going to be up to 5 minutes stale. I think this is probably good enough but might be a source for confusion.
When we call client.copy in boto, objects above 8mb use a multipart upload. So the events we can monitor will either be CopyObject OR the MultipartUpload ones.
User Story
As a data producer / controller, I want to observe the entire data flow, beginning from registration and concluding when the data becomes accessible for users
Value / Purpose
Every component in the data-as-product pipeline is generating logs in CloudWatch, and these logs are connected to Grafana. By utilizing these logs, we can configure a dashboard that will provide us with a visual representation of the data flow.
Hypothesis
For logging and monitoring (MVP), we will make use of CloudWatch logs and Grafana.
Checklist
[x] Investigate and gain a basic understanding of dashboarding functionality.
[x] Develop a straightforward example illustrating the data flow journey within the dashboard.
[x] Document additional features that could be integrated for improvement.
[x] Identify challenging aspects that could be enhanced in the future.
Definition of Done