DPL-943 As a TL (Tom) I would like a data flow diagram of the Sample Tracking POC produced so that we can plan the implementation.

TWJW-SANGER commented 1 year ago

Description As a TL (Tom) I would like a data flow diagram of the Sample Tracking POC produced so that we can plan the implementation.

Who the primary contacts are for this work TW KT AY

Knowledge or Stake holders LH

Additional context or information

- Identify studies received in last 6 months - (limit to just SGE/Mave for production, but nice to be able to expand this easily to test for other studies) - Find plates submitted in above studies

Extract data from mlwh_event associated with each column or set of columns
Extract data from flowcell tables associated with NPG sequencing run - Combine the different data artefacts into the report.

In doing so: - Identify any common patterns or queries - Propose any interim data sets that are themselves generally useful.

The output should be:

a data flow diagram for the workflow to construct the Mocked Up report in #342
a set of user stories / tasks to implement the above data flow. e.g. one arge SQL program using temp or intermediate tables or use external application (Janitor) as orchestrator / cron job.... etc...

Implementation questions to answer:

What does the application consist of?
Where does the code live?
Where is it documented?
How updates are tested to ensure correctness?
[Nice to have] Do we need smoke tests? And if so where could they live and how would they be monitored?

javidahmed64592 commented 1 year ago

I have created the following diagrams for outlining the data flow for the two queries used to populate the mocked up report:

1) https://lucid.app/lucidchart/b7cc6467-1d04-4a79-81bf-5b732460cb23/edit?viewport_loc=416%2C699%2C2559%2C1236%2C0_0&invitationId=inv_045a4cd6-acda-4b6f-8f15-10503641e878

2) https://lucid.app/lucidchart/994f9e42-18f6-4066-9ccb-9f79c2b9ac12/edit?viewport_loc=590%2C272%2C1937%2C936%2C0_0&invitationId=inv_4d077428-0601-4475-a2c2-d2759184d381

javidahmed64592 commented 1 year ago

I have updated the Sample Tracking Data Flow (events) diagram to show the table joins more explicitly, show how the fields can be used to filter for samples, studies etc.

I have also added a proposal for two temporary tables to be created which can simplify the query used to create the mock up report. I've added example SQL queries which can be used by an application such as janitor to populate these tables, and an example query to show these temporary tables being used to generate the report columns.

I still need to finish updating the equivalent diagram for the flowcell runs, and that will help refine these queries.

yoldas commented 1 year ago

Just by looking at the Sample Data Tracking Flow (events), top half of the document, I could write a query and produce Data. It definitely explains the flow for the data extraction.

with 
study_events as (
  select events.id, event_types.key, subjects.friendly_name
  from events 
    join event_types on events.event_type_id = event_types.id
    join roles on events.id = roles.event_id
    join role_types on roles.role_type_id = role_types.id
    join subjects on roles.subject_id = subjects.id
  where role_types.key = 'study'
    and subjects.friendly_name = 'MAVE_SGE v0.2.1'
),
sample_events as (
  select events.id, events.occured_at, sample.name
  from events
    join roles on events.id = roles.event_id
    join role_types on roles.role_type_id = role_types.id
    join subjects on roles.subject_id = subjects.id
    -- using mlwarehouse
    join mlwarehouse.sample on subjects.uuid = unhex(replace(mlwarehouse.sample.uuid_sample_lims, '-', ''))
  where role_types.key = 'sample'
),
labware_plate as (
  select roles.event_id, subjects.friendly_name
  from roles
    join role_types on roles.role_type_id = role_types.id
    join subjects on roles.subject_id = subjects.id
  where role_types.key = 'labware'
    -- plate barcode
    and subjects.friendly_name = 'NT1785582I'
)
select study_events.id, study_events.key, study_events.friendly_name study_friendly_name,
  sample_events.occured_at, sample_events.name,
  labware_plate.friendly_name labware_friendly_name
from study_events
  join sample_events on study_events.id = sample_events.id
  join labware_plate on sample_events.id = labware_plate.event_id;

updated: fixed a typo

yoldas commented 1 year ago

Sample Data Tracking Flow (events), bottom half. Q1. Is it possible to create the proposed role_events table as view instead? How long would it take to query it using a date condition? Q2. Are the temporary tables for the session or are they to be real tables with some periodic refresh by app?

yoldas commented 1 year ago

flowcell diagram,

The query (bottom left) could be written with CTE blocks to increase readability and to match the flow diagram.
Sample flowcell run box (top middle), I want to point out that number of flowcells increases the number of rows in the result. Applying grouping early in a CTE will make the intention clear.

javidahmed64592 commented 1 year ago

Regarding my proposed role_events table:

Q1. Creating a view should be possible and allow for the database to be queried in the same way as using a WITH statement with the benefit of not having to repeat the block of code across multiple SQL queries. We also would not need to update any table schemas if we wanted to add/remove fields from this view depending on what is useful. This leads to the question of, where should the SQL query for creating this view live? I assume it only needs to be run when we want to create or update the view, and not need to run it when new rows are added to the other tables.

I checked how long the query in the (currently) WITH statement takes using a date condition (date >= 2023-09-01) and it varies between 20-30 seconds. Presumably querying the view will be quicker than that.

Q2. If the tables were to be real tables, we could periodically update them using janitor. We would need to create the table schema and add a new task to janitor which checks for updates and writes them to the new table, similar to the labware_location task. That being said, it may be more efficient to create it as a view instead, as having it as a real table may not be any more useful than just having it as a view since we would only currently use it for the session.

I personally like the idea of creating it as a view, and we could potentially do the same for my proposed role_subjects table.

@TWJW-SANGER @KatyTaylor @stevieing Any thoughts?

yoldas commented 1 year ago

"This leads to the question of, where should the SQL query for creating this view live?"

I think in our new reporting database.

javidahmed64592 commented 1 year ago

I have added links to the diagrams on the following Confluence page: https://ssg-confluence.internal.sanger.ac.uk/display/PSDPUB/Sample+Tracking+Proof+Of+Concept

sanger / General-Backlog-Items

DPL-943 As a TL (Tom) I would like a data flow diagram of the Sample Tracking POC produced so that we can plan the implementation. #345