Closed harrietc52 closed 3 weeks ago
Query on Confluence here
/***************************************************************************************************
Create Date: 2022-11-11
Author: Harriet Craven
Description: Stored Procedure for Automating the Billing report
Used By: Richard Rance, via Tableau
Parameter(s): @from_date (DATETIME)
@to_date (DATETIME)
Usage: CALL billing_report_stored_proc('2022-06-25 00:00:00', '2022-07-25 23:59:59');
Additional notes: Date parameters are used to get runs for given timeframe (usually a financial month)
Maintain: When updating this query, drop and recreate the stored procedure in MLWH Training and Production
Documentation: DPL-437 https://github.com/sanger/unified_warehouse/issues/386
Confluence https://ssg-confluence.internal.sanger.ac.uk/display/PSDPUB/Automating+Billing+Report
****************************************************************************************************/
-- Change delimiter to //
delimiter //
-- Create Stored Procedure
CREATE PROCEDURE billing_report_stored_proc (IN from_date DATETIME, IN to_date DATETIME)
BEGIN
DECLARE control_phix VARCHAR(30);
DECLARE control_illumina VARCHAR(30);
SET control_phix = 'Heron PhiX';
SET control_illumina = 'Illumina Controls';
-- Outer query
-- This grouping calculates the `total` amount of lanes occupied by samples in a given "group" (see end of query)
SELECT
iseq_run_lane_metrics.instrument_model AS platform
, iseq_flowcell.cost_code AS project_cost_code
, study.name AS study_name
, IF(iseq_run_lane_metrics.qc_seq = 1, 'passed', IF(iseq_run_lane_metrics.qc_seq = '0', 'failed', iseq_run_lane_metrics.qc_seq ))
AS qc_outcome
, IF(iseq_run.rp__sbs_consumable_version = '1', 'v1', IF(iseq_run.rp__sbs_consumable_version = '3', 'v1.5', iseq_run.rp__sbs_consumable_version))
AS 'v1/1.5'
, IF(iseq_run.rp__workflow_type = 'NovaSeqXp', 'XP', IF(iseq_run.rp__workflow_type = 'NovaSeqStandard', 'No XP', iseq_run.rp__workflow_type) )
AS xp
, iseq_run.rp__flow_cell_mode AS sp
, iseq_run.rp__read1_number_of_cycles AS read1
, iseq_run.rp__read2_number_of_cycles AS read2
, SUM(lanes.proportion_of_lane_per_sample) AS total
FROM
iseq_run
INNER JOIN
(
-- Inner query 1
-- There can be multiple QC complete run events,
-- this query finds all "QC complete" runs within a given timeframe.
-- Group by run ID.
-- If there are more than 1 "QC complete" events for a given run ID,
-- select only the first completed run (based on min `date`)
SELECT
id_run
, MIN(date) AS qc_complete_date
FROM
iseq_run_status
INNER JOIN
iseq_run_status_dict
ON iseq_run_status_dict.id_run_status_dict = iseq_run_status.id_run_status_dict
WHERE
iseq_run_status_dict.description = 'qc complete'
AND iseq_run_status.date >= from_date
AND iseq_run_status.date <= to_date
GROUP BY
iseq_run_status.id_run
)
AS qc_complete
ON qc_complete.id_run = iseq_run.id_run
INNER JOIN
iseq_product_metrics
ON iseq_run.id_run = iseq_product_metrics.id_run
INNER JOIN
iseq_flowcell
ON iseq_product_metrics.id_iseq_flowcell_tmp = iseq_flowcell.id_iseq_flowcell_tmp
INNER JOIN
study
ON iseq_flowcell.id_study_tmp = study.id_study_tmp
INNER JOIN
iseq_run_lane_metrics
ON iseq_product_metrics.id_run = iseq_run_lane_metrics.id_run
AND iseq_product_metrics.position = iseq_run_lane_metrics.position
INNER JOIN
(
-- Inner query 2
-- Group samples by lane ID
-- Count the number of samples (exluding controls) in a lane
-- Assuming equal distribution, calculate the proportion of lane occupied per sample (1/ number of samples)
-- Append this information to the sample, joining on lane ID
SELECT
samples.*
, format(1 / COUNT(*), 10) AS proportion_of_lane_per_sample
FROM
(
-- Inner query 3
-- Get the samples for the specific runs
-- Excluding controls
SELECT
iseq_flowcell.entity_id_lims AS lane_id
, iseq_flowcell.cost_code AS project_cost_code
, study.name
FROM
iseq_run
INNER JOIN
(
-- Inner query 4
-- This is the same as Inner query 1 (see above)
SELECT
id_run
, MIN(date) AS qc_complete_date
FROM
iseq_run_status
INNER JOIN
iseq_run_status_dict
ON iseq_run_status_dict.id_run_status_dict = iseq_run_status.id_run_status_dict
WHERE
iseq_run_status_dict.description = 'qc complete'
AND iseq_run_status.date >= from_date
AND iseq_run_status.date <= to_date
GROUP BY
iseq_run_status.id_run
)
AS qc_complete
ON qc_complete.id_run = iseq_run.id_run
INNER JOIN
iseq_product_metrics
ON iseq_run.id_run = iseq_product_metrics.id_run
INNER JOIN
iseq_run_lane_metrics
ON iseq_product_metrics.id_run = iseq_run_lane_metrics.id_run
AND iseq_product_metrics.position = iseq_run_lane_metrics.position
INNER JOIN
iseq_flowcell
ON iseq_product_metrics.id_iseq_flowcell_tmp = iseq_flowcell.id_iseq_flowcell_tmp
INNER JOIN
study
ON iseq_flowcell.id_study_tmp = study.id_study_tmp
WHERE
-- Controls are excluded, see Confluence documentation
study.name NOT IN (control_phix, control_illumina)
)
AS samples
GROUP BY
samples.lane_id
)
AS lanes
ON lanes.lane_id = iseq_flowcell.entity_id_lims
WHERE
-- Controls are excluded, see Confluence documentation
study.name NOT IN (control_phix, control_illumina)
GROUP BY
study.id_study_lims
, project_cost_code
, platform
, qc_outcome
, iseq_run.rp__workflow_type
, iseq_run.rp__flow_cell_mode
;
END
//
-- Change delimiter back to ;
delimiter ;
-- Call Stored Procedure
-- October (118 without controls)
-- date >= '2022-10-01 00:00:00'
-- date <= '2022-10-24 23:59:59'
CALL billing_report_stored_proc('2022-10-01 00:00:00', '2022-10-24 23:59:59');
-- September (195 without controls)
-- date >= '2022-08-25 00:00:00'
-- date <= '2022-09-30 23:59:59'
CALL billing_report_stored_proc('2022-08-25 00:00:00', '2022-09-30 23:59:59');
-- August (155 without controls)
-- from = '2022-07-26 00:00:00'
-- to = '2022-08-24 23:59:59'
CALL billing_report_stored_proc('2022-07-26 00:00:00', '2022-08-24 23:59:59');
-- July (146 without controls)
-- from = '2022-06-25 00:00:00'
-- to = '2022-07-25 23:59:59'
CALL billing_report_stored_proc('2022-06-25 00:00:00', '2022-07-25 23:59:59');
-- Drop Stored Procedure
DROP PROCEDURE billing_report_stored_proc;
@harrietc52 Excellent piece of work. Well structured and the documentation makes sense. It is quite impressive how you have managed to distill the billing report into a single query. This is proper business logic. Some questions:
@stevieing Thanks for the questions. Hopefully some improvements below:
2a The statement to create the stored procedure itself, can be accessed via any Database Management application (MySQL Workbench etc) by clicking on the stored procedure, then copy "Create Statement" - on the MLWH training and production database.
The other bits around the query, such as the comments, drop statement and testing statements are currently stored in the this GitHub story and the Confluence Billing documentation here.
There is a backlog item to create a repo for documenting queries #410
2b. Matt Francis is currently creating a Tableau view which will read from the stored procedure, so thats is how Rich will access it (not by running the stored procedure directly himself). The query isn't stored here, just called.
We've added a comment to the query, and referenced to look at the Confluence page for more info. And moved the controls into a constant
User story As a developer, I would to add comments to the billing query and improve the syntax of the query to make it more readable.
Who are the primary contacts for this story @harrietc52
Who is the nominated tester for UAT e.g. John S (don't include surnames in public repos)
Acceptance criteria To be considered successful the solution must allow:
Dependencies This story is blocked by the following dependencies:
References This story has a non-blocking relationship with:
Additional context Add any other context or screenshots about the feature request here.
Some suggestions to help