sanger / General-Backlog-Items

Broad bucket to collate backlog items that have no obvious repository
0 stars 0 forks source link

DPL-943 As a TL (Tom) I would like a data flow diagram of the Sample Tracking POC produced so that we can plan the implementation. #345

Closed TWJW-SANGER closed 2 months ago

TWJW-SANGER commented 1 year ago

Description As a TL (Tom) I would like a data flow diagram of the Sample Tracking POC produced so that we can plan the implementation.

Who the primary contacts are for this work TW KT AY

Knowledge or Stake holders LH

Additional context or information

  - Identify studies received in last 6 months -  (limit to just SGE/Mave for production, but nice to be able to expand this easily to test for other studies)   - Find plates submitted in above studies

In doing so:    - Identify any common patterns or queries   - Propose any interim data sets that are themselves generally useful.

The output should be:

Implementation questions to answer:

javidahmed64592 commented 1 year ago

I have created the following diagrams for outlining the data flow for the two queries used to populate the mocked up report:

1) https://lucid.app/lucidchart/b7cc6467-1d04-4a79-81bf-5b732460cb23/edit?viewport_loc=416%2C699%2C2559%2C1236%2C0_0&invitationId=inv_045a4cd6-acda-4b6f-8f15-10503641e878

2) https://lucid.app/lucidchart/994f9e42-18f6-4066-9ccb-9f79c2b9ac12/edit?viewport_loc=590%2C272%2C1937%2C936%2C0_0&invitationId=inv_4d077428-0601-4475-a2c2-d2759184d381

javidahmed64592 commented 1 year ago

I have updated the Sample Tracking Data Flow (events) diagram to show the table joins more explicitly, show how the fields can be used to filter for samples, studies etc.

I have also added a proposal for two temporary tables to be created which can simplify the query used to create the mock up report. I've added example SQL queries which can be used by an application such as janitor to populate these tables, and an example query to show these temporary tables being used to generate the report columns.

I still need to finish updating the equivalent diagram for the flowcell runs, and that will help refine these queries.

yoldas commented 1 year ago

Just by looking at the Sample Data Tracking Flow (events), top half of the document, I could write a query and produce Data. It definitely explains the flow for the data extraction.

with 
study_events as (
  select events.id, event_types.key, subjects.friendly_name
  from events 
    join event_types on events.event_type_id = event_types.id
    join roles on events.id = roles.event_id
    join role_types on roles.role_type_id = role_types.id
    join subjects on roles.subject_id = subjects.id
  where role_types.key = 'study'
    and subjects.friendly_name = 'MAVE_SGE v0.2.1'
),
sample_events as (
  select events.id, events.occured_at, sample.name
  from events
    join roles on events.id = roles.event_id
    join role_types on roles.role_type_id = role_types.id
    join subjects on roles.subject_id = subjects.id
    -- using mlwarehouse
    join mlwarehouse.sample on subjects.uuid = unhex(replace(mlwarehouse.sample.uuid_sample_lims, '-', ''))
  where role_types.key = 'sample'
),
labware_plate as (
  select roles.event_id, subjects.friendly_name
  from roles
    join role_types on roles.role_type_id = role_types.id
    join subjects on roles.subject_id = subjects.id
  where role_types.key = 'labware'
    -- plate barcode
    and subjects.friendly_name = 'NT1785582I'
)
select study_events.id, study_events.key, study_events.friendly_name study_friendly_name,
  sample_events.occured_at, sample_events.name,
  labware_plate.friendly_name labware_friendly_name
from study_events
  join sample_events on study_events.id = sample_events.id
  join labware_plate on sample_events.id = labware_plate.event_id;

updated: fixed a typo

yoldas commented 1 year ago

Sample Data Tracking Flow (events), bottom half. Q1. Is it possible to create the proposed role_events table as view instead? How long would it take to query it using a date condition? Q2. Are the temporary tables for the session or are they to be real tables with some periodic refresh by app?

yoldas commented 1 year ago

flowcell diagram,

javidahmed64592 commented 1 year ago

Regarding my proposed role_events table:

Q1. Creating a view should be possible and allow for the database to be queried in the same way as using a WITH statement with the benefit of not having to repeat the block of code across multiple SQL queries. We also would not need to update any table schemas if we wanted to add/remove fields from this view depending on what is useful. This leads to the question of, where should the SQL query for creating this view live? I assume it only needs to be run when we want to create or update the view, and not need to run it when new rows are added to the other tables.

I checked how long the query in the (currently) WITH statement takes using a date condition (date >= 2023-09-01) and it varies between 20-30 seconds. Presumably querying the view will be quicker than that.

Q2. If the tables were to be real tables, we could periodically update them using janitor. We would need to create the table schema and add a new task to janitor which checks for updates and writes them to the new table, similar to the labware_location task. That being said, it may be more efficient to create it as a view instead, as having it as a real table may not be any more useful than just having it as a view since we would only currently use it for the session.

I personally like the idea of creating it as a view, and we could potentially do the same for my proposed role_subjects table.

@TWJW-SANGER @KatyTaylor @stevieing Any thoughts?

yoldas commented 1 year ago

"This leads to the question of, where should the SQL query for creating this view live?"

I think in our new reporting database.

javidahmed64592 commented 1 year ago

I have added links to the diagrams on the following Confluence page: https://ssg-confluence.internal.sanger.ac.uk/display/PSDPUB/Sample+Tracking+Proof+Of+Concept