sanger / General-Backlog-Items

Broad bucket to collate backlog items that have no obvious repository
0 stars 0 forks source link

ST0004 As TL I would like to understand why the cost code and platform columns are showing null for most plates. #392

Open TWJW-SANGER opened 4 months ago

TWJW-SANGER commented 4 months ago

As TL I would like to understand why the cost code and platform columns are showing null for most plates.

Acceptance Criteria

Outputs Either identify an improvement to the query to resolve the issue (possibly by creating a new user story) OR Raise missing data with NPG as a ticket with examples OR Identify another means of resolving this

khelwood commented 3 months ago

@TWJW-SANGER Is there any example of a specific row that should have a cost code in the view and doesn't? or are they all supposed to have cost codes?

TWJW-SANGER commented 3 months ago

@khelwood I think the "DTOL_Darwin Tree of Life" study is interesting as some of the rows have cost codes (NT1646137P) and some don't (NT1647404T)....

khelwood commented 3 months ago

@TWJW-SANGER The cost code comes from the iseq_flowcell table, which is LEFT JOINed onto in the view. Unless that should be changed to a strict join, there will be nulls wherever there is no matching row in the flowcell table.

TWJW-SANGER commented 3 months ago

Ah, ok that makes sense.

So, if the plate/sample hasn't been sequenced then the cost code and platform will be NULL.

That might need some explanatory text on the report I guess.

I can find one example at least where a plate/sample has been sequenced and the cost code is still NULL.

select * from sample_tracking_view where cost_code IS NULL AND sequencing_start_count != 0;

Gives a row for SQPP-42562-L.

I guess this could be a data problem?

khelwood commented 3 months ago

@TWJW-SANGER The sample flowcell view has multiple rows for SQPP-42562-L, some with cost codes and some without. Is that to be expected? Either that is the source of the error; or it's a problem with the way the sample tracking view is aggregating the information from the flowcell view.

TWJW-SANGER commented 3 months ago

@khelwood I think I've tracked down the source of the issue for SQPP-42562-L.

The sample_tracking_view uses GROUP BY manifest_plate_barcode but some of the select statements are not using AGGREGATE functions, so unless the pre-grouped values for a column are constant - MySQL will semi-randomly pick a resulting value.

Looking at the ungrouped values for cost_code and platform for samples on SQPP-42562-L shows a number of different possible values, where presumably where some samples haven't been sequenced, some have and some sequenced multiple times under different cost codes.

I suggest that we alter the sample_tracking_view so that the SELECT has aggregating functions for everything other than study name and labware_human_barcode e.g.

SELECT sample_flowcell_view.study_name, MIN(IF(sample_events.event_type = 'sample_manifest.updated', sample_events.occured_at, NULL)) manifest_uploaded, sample_flowcell_view.labware_human_barcode manifest_plate_barcode, GROUP_CONCAT(DISTINCT sample_flowcell_view.pipeline_id_lims SEPARATOR ";") library_type, GROUP_CONCAT(DISTINCT sample_flowcell_view.cost_code SEPARATOR ";") cost_code, GROUP_CONCAT(DISTINCT sample_flowcell_view.instrument_model SEPARATOR ";") platform, MIN(IF(sample_events.event_type = 'labware.received', sample_events.occured_at, NULL)) labware_received, COUNT(DISTINCT(IF(sample_flowcell_view.qc_early IS NOT NULL, sample_flowcell_view.sample_uuid, NULL))) work_started_count, -- Count number of unique samples for this plate that have non-null QC timestamps for dilution MIN(sample_flowcell_view.qc_early) work_started_first, MAX(sample_flowcell_view.qc_late) work_started_last, COUNT(DISTINCT(IF(sample_events.event_type = 'library_start', sample_events.subject_uuid_bin, NULL))) library_start_count, MIN(IF(sample_events.event_type = 'library_start', sample_events.occured_at, NULL)) library_start_first, MAX(IF(sample_events.event_type = 'library_start', sample_events.occured_at, NULL)) library_start_last, COUNT(DISTINCT(IF(sample_events.event_type = 'library_complete', sample_events.subject_uuid_bin, NULL))) library_complete_count, MIN(IF(sample_events.event_type = 'library_complete', sample_events.occured_at, NULL)) library_complete_first, MAX(IF(sample_events.event_type = 'library_complete', sample_events.occured_at, NULL)) library_complete_last, COUNT(DISTINCT(IF(sample_events.event_type = 'sequencing_start', sample_events.subject_uuid_bin, NULL))) sequencing_run_start_count, MIN(IF(sample_events.event_type = 'sequencing_start', sample_events.occured_at, NULL)) sequencing_run_start_first, MAX(IF(sample_events.event_type = 'sequencing_start', sample_events.occured_at, NULL)) sequencing_run_start_last, COUNT(DISTINCT(IF(sample_events.event_type = 'sequencing_complete', sample_events.subject_uuid_bin, NULL))) sequencing_qc_complete_count, MIN(IF(sample_events.event_type = 'sequencing_complete', sample_events.occured_at, NULL)) sequencing_qc_complete_first, MAX(IF(sample_events.event_type = 'sequencing_complete', sample_events.occured_at, NULL)) sequencing_qc_complete_last

What do you think?

khelwood commented 3 months ago

@TWJW-SANGER Yes that sounds reasonable. That is more or less what I was suggesting in my previous reply

khelwood commented 3 months ago

@TWJW-SANGER With that change, SQPP-42562-L will be listed with the cost-code S2658;S4236