Closed TWJW-SANGER closed 2 months ago
I have created the following diagrams for outlining the data flow for the two queries used to populate the mocked up report:
I have updated the Sample Tracking Data Flow (events) diagram to show the table joins more explicitly, show how the fields can be used to filter for samples, studies etc.
I have also added a proposal for two temporary tables to be created which can simplify the query used to create the mock up report. I've added example SQL queries which can be used by an application such as janitor
to populate these tables, and an example query to show these temporary tables being used to generate the report columns.
I still need to finish updating the equivalent diagram for the flowcell runs, and that will help refine these queries.
Just by looking at the Sample Data Tracking Flow (events), top half of the document, I could write a query and produce Data. It definitely explains the flow for the data extraction.
with
study_events as (
select events.id, event_types.key, subjects.friendly_name
from events
join event_types on events.event_type_id = event_types.id
join roles on events.id = roles.event_id
join role_types on roles.role_type_id = role_types.id
join subjects on roles.subject_id = subjects.id
where role_types.key = 'study'
and subjects.friendly_name = 'MAVE_SGE v0.2.1'
),
sample_events as (
select events.id, events.occured_at, sample.name
from events
join roles on events.id = roles.event_id
join role_types on roles.role_type_id = role_types.id
join subjects on roles.subject_id = subjects.id
-- using mlwarehouse
join mlwarehouse.sample on subjects.uuid = unhex(replace(mlwarehouse.sample.uuid_sample_lims, '-', ''))
where role_types.key = 'sample'
),
labware_plate as (
select roles.event_id, subjects.friendly_name
from roles
join role_types on roles.role_type_id = role_types.id
join subjects on roles.subject_id = subjects.id
where role_types.key = 'labware'
-- plate barcode
and subjects.friendly_name = 'NT1785582I'
)
select study_events.id, study_events.key, study_events.friendly_name study_friendly_name,
sample_events.occured_at, sample_events.name,
labware_plate.friendly_name labware_friendly_name
from study_events
join sample_events on study_events.id = sample_events.id
join labware_plate on sample_events.id = labware_plate.event_id;
updated: fixed a typo
Sample Data Tracking Flow (events), bottom half. Q1. Is it possible to create the proposed role_events table as view instead? How long would it take to query it using a date condition? Q2. Are the temporary tables for the session or are they to be real tables with some periodic refresh by app?
flowcell diagram,
Regarding my proposed role_events
table:
Q1. Creating a view should be possible and allow for the database to be queried in the same way as using a WITH
statement with the benefit of not having to repeat the block of code across multiple SQL queries. We also would not need to update any table schemas if we wanted to add/remove fields from this view depending on what is useful. This leads to the question of, where should the SQL query for creating this view live? I assume it only needs to be run when we want to create or update the view, and not need to run it when new rows are added to the other tables.
I checked how long the query in the (currently) WITH
statement takes using a date condition (date >= 2023-09-01)
and it varies between 20-30 seconds. Presumably querying the view will be quicker than that.
Q2. If the tables were to be real tables, we could periodically update them using janitor
. We would need to create the table schema and add a new task to janitor
which checks for updates and writes them to the new table, similar to the labware_location
task. That being said, it may be more efficient to create it as a view instead, as having it as a real table may not be any more useful than just having it as a view since we would only currently use it for the session.
I personally like the idea of creating it as a view, and we could potentially do the same for my proposed role_subjects
table.
@TWJW-SANGER @KatyTaylor @stevieing Any thoughts?
"This leads to the question of, where should the SQL query for creating this view live?"
I think in our new reporting database.
I have added links to the diagrams on the following Confluence page: https://ssg-confluence.internal.sanger.ac.uk/display/PSDPUB/Sample+Tracking+Proof+Of+Concept
Description As a TL (Tom) I would like a data flow diagram of the Sample Tracking POC produced so that we can plan the implementation.
Who the primary contacts are for this work TW KT AY
Knowledge or Stake holders LH
Additional context or information
- Identify studies received in last 6 months - (limit to just SGE/Mave for production, but nice to be able to expand this easily to test for other studies) - Find plates submitted in above studies
In doing so: - Identify any common patterns or queries - Propose any interim data sets that are themselves generally useful.
The output should be:
Implementation questions to answer: