DPL-437-3 Billing report: Add comments and improve syntax

harrietc52 commented 2 years ago

User story As a developer, I would to add comments to the billing query and improve the syntax of the query to make it more readable.

Who are the primary contacts for this story @harrietc52

Who is the nominated tester for UAT e.g. John S (don't include surnames in public repos)

Acceptance criteria To be considered successful the solution must allow:

[x] Add thorough comments throughout the query
[x] Each subquery has a comment
[x] Improve the syntax throughout the query
[ ] Update the repo (if created) https://github.com/sanger/unified_warehouse/issues/410 where the query is stored

Dependencies This story is blocked by the following dependencies:

sanger/#

References This story has a non-blocking relationship with:

This is a spin-out story from https://github.com/sanger/unified_warehouse/issues/386

Additional context Add any other context or screenshots about the feature request here.

Some suggestions to help

This tool https://www.red-gate.com/website/sql-formatter might be useful for formatting

harrietc52 commented 1 year ago

Query on Confluence here

/***************************************************************************************************
Create Date:        2022-11-11
Author:             Harriet Craven
Description:        Stored Procedure for Automating the Billing report
Used By:            Richard Rance, via Tableau
Parameter(s):       @from_date (DATETIME)
                    @to_date (DATETIME)
Usage:              CALL billing_report_stored_proc('2022-06-25 00:00:00', '2022-07-25 23:59:59');
Additional notes:   Date parameters are used to get runs for given timeframe (usually a financial month)
Maintain:           When updating this query, drop and recreate the stored procedure in MLWH Training and Production
Documentation:      DPL-437     https://github.com/sanger/unified_warehouse/issues/386
                    Confluence  https://ssg-confluence.internal.sanger.ac.uk/display/PSDPUB/Automating+Billing+Report
****************************************************************************************************/

-- Change delimiter to //
delimiter //

-- Create Stored Procedure
CREATE PROCEDURE billing_report_stored_proc (IN from_date DATETIME, IN to_date DATETIME)

BEGIN
    DECLARE control_phix VARCHAR(30);
    DECLARE control_illumina VARCHAR(30);
    SET control_phix = 'Heron PhiX';
    SET control_illumina = 'Illumina Controls';

  -- Outer query
  -- This grouping calculates the `total` amount of lanes occupied by samples in a given "group" (see end of query)
  SELECT
    iseq_run_lane_metrics.instrument_model     AS platform
    , iseq_flowcell.cost_code                  AS project_cost_code
    , study.name                               AS study_name
    , IF(iseq_run_lane_metrics.qc_seq = 1, 'passed', IF(iseq_run_lane_metrics.qc_seq = '0', 'failed', iseq_run_lane_metrics.qc_seq ))
                                               AS qc_outcome
    , IF(iseq_run.rp__sbs_consumable_version = '1', 'v1', IF(iseq_run.rp__sbs_consumable_version = '3', 'v1.5', iseq_run.rp__sbs_consumable_version))
                                               AS 'v1/1.5'
    , IF(iseq_run.rp__workflow_type = 'NovaSeqXp', 'XP', IF(iseq_run.rp__workflow_type = 'NovaSeqStandard', 'No XP', iseq_run.rp__workflow_type) )
                                               AS xp
    , iseq_run.rp__flow_cell_mode              AS sp
    , iseq_run.rp__read1_number_of_cycles      AS read1
    , iseq_run.rp__read2_number_of_cycles      AS read2
    , SUM(lanes.proportion_of_lane_per_sample) AS total
  FROM
    iseq_run
    INNER JOIN
      (
        -- Inner query 1
        -- There can be multiple QC complete run events,
        -- this query finds all "QC complete" runs within a given timeframe.
        -- Group by run ID.
        -- If there are more than 1 "QC complete" events for a given run ID,
        -- select only the first completed run (based on min `date`)
        SELECT
          id_run
          , MIN(date) AS qc_complete_date
        FROM
          iseq_run_status
          INNER JOIN
            iseq_run_status_dict
            ON iseq_run_status_dict.id_run_status_dict = iseq_run_status.id_run_status_dict
        WHERE
          iseq_run_status_dict.description = 'qc complete'
          AND iseq_run_status.date >= from_date
          AND iseq_run_status.date <= to_date
        GROUP BY
          iseq_run_status.id_run
      )
      AS qc_complete
      ON qc_complete.id_run = iseq_run.id_run
    INNER JOIN
      iseq_product_metrics
      ON iseq_run.id_run = iseq_product_metrics.id_run
    INNER JOIN
      iseq_flowcell
      ON iseq_product_metrics.id_iseq_flowcell_tmp = iseq_flowcell.id_iseq_flowcell_tmp
    INNER JOIN
      study
      ON iseq_flowcell.id_study_tmp = study.id_study_tmp
    INNER JOIN
      iseq_run_lane_metrics
      ON iseq_product_metrics.id_run = iseq_run_lane_metrics.id_run
      AND iseq_product_metrics.position = iseq_run_lane_metrics.position
    INNER JOIN
      (
        -- Inner query 2
        -- Group samples by lane ID
        -- Count the number of samples (exluding controls) in a lane
        -- Assuming equal distribution, calculate the proportion of lane occupied per sample (1/ number of samples)
        -- Append this information to the sample, joining on lane ID
        SELECT
          samples.*
          , format(1 / COUNT(*), 10) AS proportion_of_lane_per_sample
        FROM
          (
            -- Inner query 3
            -- Get the samples for the specific runs
            -- Excluding controls
            SELECT
              iseq_flowcell.entity_id_lims AS lane_id
              , iseq_flowcell.cost_code AS project_cost_code
              , study.name
            FROM
              iseq_run
              INNER JOIN
                (
                  -- Inner query 4
                  -- This is the same as Inner query 1 (see above)
                  SELECT
                    id_run
                    , MIN(date) AS qc_complete_date
                  FROM
                    iseq_run_status
                    INNER JOIN
                      iseq_run_status_dict
                      ON iseq_run_status_dict.id_run_status_dict = iseq_run_status.id_run_status_dict
                  WHERE
                    iseq_run_status_dict.description = 'qc complete'
                    AND iseq_run_status.date >= from_date
                    AND iseq_run_status.date <= to_date
                  GROUP BY
                    iseq_run_status.id_run
                )
                AS qc_complete
                ON qc_complete.id_run = iseq_run.id_run
              INNER JOIN
                iseq_product_metrics
                ON iseq_run.id_run = iseq_product_metrics.id_run
              INNER JOIN
                iseq_run_lane_metrics
                ON iseq_product_metrics.id_run = iseq_run_lane_metrics.id_run
                AND iseq_product_metrics.position = iseq_run_lane_metrics.position
              INNER JOIN
                iseq_flowcell
                ON iseq_product_metrics.id_iseq_flowcell_tmp = iseq_flowcell.id_iseq_flowcell_tmp
              INNER JOIN
                study
                ON iseq_flowcell.id_study_tmp = study.id_study_tmp
            WHERE
                -- Controls are excluded, see Confluence documentation
              study.name NOT IN (control_phix, control_illumina)
          )
          AS samples
        GROUP BY
          samples.lane_id
      )
      AS lanes
      ON lanes.lane_id = iseq_flowcell.entity_id_lims
  WHERE
    -- Controls are excluded, see Confluence documentation
    study.name NOT IN (control_phix, control_illumina)
  GROUP BY
    study.id_study_lims
    , project_cost_code
    , platform
    , qc_outcome
    , iseq_run.rp__workflow_type
    , iseq_run.rp__flow_cell_mode
  ;
END
//

-- Change delimiter back to ;
delimiter ;

-- Call Stored Procedure

-- October (118 without controls)
-- date >= '2022-10-01 00:00:00'
-- date <= '2022-10-24 23:59:59'
CALL billing_report_stored_proc('2022-10-01 00:00:00', '2022-10-24 23:59:59');

-- September (195 without controls)
-- date >= '2022-08-25 00:00:00'
-- date <= '2022-09-30 23:59:59'
CALL billing_report_stored_proc('2022-08-25 00:00:00', '2022-09-30 23:59:59');

-- August (155 without controls)
-- from =  '2022-07-26 00:00:00'
-- to = '2022-08-24 23:59:59'
CALL billing_report_stored_proc('2022-07-26 00:00:00', '2022-08-24 23:59:59');

-- July (146 without controls)
-- from = '2022-06-25 00:00:00'
-- to =  '2022-07-25 23:59:59'
CALL billing_report_stored_proc('2022-06-25 00:00:00', '2022-07-25 23:59:59');

-- Drop Stored Procedure
DROP PROCEDURE billing_report_stored_proc;

stevieing commented 1 year ago

@harrietc52 Excellent piece of work. Well structured and the documentation makes sense. It is quite impressive how you have managed to distill the billing report into a single query. This is proper business logic. Some questions:

Inner query 4 says duplication of query 1. I am not entirely sure what that means.
Where is this query stored in real life. I am guessing if Rich is able to run this maybe it will be in Tableau or elswhere?
You have certain studies not included. I know that there is a reason for this but does that reason need to be in the documentation and also wondering if it was possible to maybe have those studies as a constant at the top of the query so you could add or remove as necessary?

harrietc52 commented 1 year ago

@stevieing Thanks for the questions. Hopefully some improvements below:

We've updated the comment here. Hopefully it is clearer now

2a The statement to create the stored procedure itself, can be accessed via any Database Management application (MySQL Workbench etc) by clicking on the stored procedure, then copy "Create Statement" - on the MLWH training and production database.

The other bits around the query, such as the comments, drop statement and testing statements are currently stored in the this GitHub story and the Confluence Billing documentation here.

There is a backlog item to create a repo for documenting queries #410

2b. Matt Francis is currently creating a Tableau view which will read from the stored procedure, so thats is how Rich will access it (not by running the stored procedure directly himself). The query isn't stored here, just called.

We've added a comment to the query, and referenced to look at the Confluence page for more info. And moved the controls into a constant

sanger / unified_warehouse

DPL-437-3 Billing report: Add comments and improve syntax #411