DPL-642-1 Add index to long_read_qc_result to be able to join with samples

emrojo commented 1 year ago

User story As a DB user I would like to be able to join long_read_qc_result with the samples using an indexed field so the queries perform better

Who are the primary contacts for this story Eduardo Martin

Who is the nominated tester for UAT Eduardo Martin

Acceptance criteria To be considered successful the solution must allow:

[x] Have an index we can use to join tables quicker
[x] When running SQL EXPLAIN on a query it should not display type=ALL when joining using the indexed field (we make sure that the index is in use) (SEE OUTPUT IN THE EXAMPLE):

Example EXPLAIN:

explain select *
from sample 
left outer join long_read_qc_result as lr_shearing_qc_comments on lr_shearing_qc_comments.sample_id=sample.supplier_name and 
    lr_shearing_qc_comments.assay_type_key='shearing_qc_comments';

CURRENT OUTPUT:

id  select_type table   partitions  type    possible_keys   key key_len ref rows    filtered    Extra
1   SIMPLE  sample  NULL    ALL NULL    NULL    NULL    NULL    1512863 100.00  NULL
1   SIMPLE  lr_shearing_qc_comments NULL    ALL NULL    NULL    NULL    NULL    69022   100.00  Using where; Using join buffer (Block Nested Loop)```

Dependencies This story is blocked by the following dependencies:

sanger/#

References This story has a non-blocking relationship with:

sanger/#

Additional context Add any other context or screenshots about the feature request here.

emrojo commented 1 year ago

Indexes tested for this query (will be in use for SSRs UAT):

-- qc comments
explain select distinct 
-- stock resource values
sr.labware_type,
sr.labware_human_barcode,
sr.labware_coordinate,
-- sample values
sample.name,  
sample.id_lims as id_lims,
sample.accession_number as accession_number,
sample.common_name as common_name,
sample.taxon_id as taxon_id,
sample.country_of_origin as country_of_origin,
sample.sanger_sample_id as sanger_sample_id,
sample.supplier_name as supplier_name,
sample.donor_id as donor_id,
sample.date_of_sample_collection as date_of_sample_collection,
sample.priority_level as priority_level,
sample.donor_id as donor_id,
-- qc values
lr_shearing_qc_comments.value as qc_shearing_comments, 
lr_sheared_femto_fragment_size.value as qc_sheared_femto_fragment_size,
lr_post_spri_concentration.value as qc_post_spri_concentration,
lr_post_spri_volume.value as qc_post_spri_volume,
lr_final_nano_drop_280.value as qc_final_nano_drop_280,
lr_final_nano_drop_230.value as qc_final_nano_drop_230,
lr_final_nano_drop.value as qc_final_nano_drop
from sample
inner join stock_resource as sr on sr.id_sample_tmp=sample.id_sample_tmp
inner join
(
-- shearing_qc_comments
select distinct sample.supplier_name, lr.value
from long_read_qc_result as lr_base 
inner join sample on sample.supplier_name=lr_base.sample_id
inner join stock_resource as sr on sr.id_sample_tmp=sample.id_sample_tmp
inner join long_read_qc_result as lr on lr.id_long_read_qc_result_tmp=lr_base.id_long_read_qc_result_tmp
where lr.assay_type_key='shearing_qc_comments'
) as lr_shearing_qc_comments on lr_shearing_qc_comments.supplier_name=sample.supplier_name
inner join
(
-- sheared_femto_fragment_size
select distinct sample.supplier_name, lr.value
from long_read_qc_result as lr_base 
inner join sample on sample.supplier_name=lr_base.sample_id
inner join stock_resource as sr on sr.id_sample_tmp=sample.id_sample_tmp
inner join long_read_qc_result as lr on lr.id_long_read_qc_result_tmp=lr_base.id_long_read_qc_result_tmp
where lr.assay_type_key='sheared_femto_fragment_size'
) as lr_sheared_femto_fragment_size on lr_sheared_femto_fragment_size.supplier_name=sample.supplier_name
inner join
(
-- post_spri_concentration
select distinct sample.supplier_name, lr.value
from long_read_qc_result as lr_base 
inner join sample on sample.supplier_name=lr_base.sample_id
inner join stock_resource as sr on sr.id_sample_tmp=sample.id_sample_tmp
inner join long_read_qc_result as lr on lr.id_long_read_qc_result_tmp=lr_base.id_long_read_qc_result_tmp
where lr.assay_type_key='post_spri_concentration'
) as lr_post_spri_concentration on lr_post_spri_concentration.supplier_name=sample.supplier_name
inner join
(
-- post_spri_volume
select distinct sample.supplier_name, lr.value
from long_read_qc_result as lr_base 
inner join sample on sample.supplier_name=lr_base.sample_id
inner join stock_resource as sr on sr.id_sample_tmp=sample.id_sample_tmp
inner join long_read_qc_result as lr on lr.id_long_read_qc_result_tmp=lr_base.id_long_read_qc_result_tmp
where lr.assay_type_key='post_spri_volume'
) as lr_post_spri_volume on lr_post_spri_volume.supplier_name=sample.supplier_name
inner join
(
-- final_nano_drop_280
select distinct sample.supplier_name, lr.value
from long_read_qc_result as lr_base 
inner join sample on sample.supplier_name=lr_base.sample_id
inner join stock_resource as sr on sr.id_sample_tmp=sample.id_sample_tmp
inner join long_read_qc_result as lr on lr.id_long_read_qc_result_tmp=lr_base.id_long_read_qc_result_tmp
where lr.assay_type_key='final_nano_drop_280'
) as lr_final_nano_drop_280 on lr_final_nano_drop_280.supplier_name=sample.supplier_name
inner join
(
-- final_nano_drop_230
select distinct sample.supplier_name, lr.value
from long_read_qc_result as lr_base 
inner join sample on sample.supplier_name=lr_base.sample_id
inner join stock_resource as sr on sr.id_sample_tmp=sample.id_sample_tmp
inner join long_read_qc_result as lr on lr.id_long_read_qc_result_tmp=lr_base.id_long_read_qc_result_tmp
where lr.assay_type_key='final_nano_drop_230'
) as lr_final_nano_drop_230 on lr_final_nano_drop_230.supplier_name=sample.supplier_name
inner join
(
-- final_nano_drop
select distinct sample.supplier_name, lr.value
from long_read_qc_result as lr_base 
inner join sample on sample.supplier_name=lr_base.sample_id
inner join stock_resource as sr on sr.id_sample_tmp=sample.id_sample_tmp
inner join long_read_qc_result as lr on lr.id_long_read_qc_result_tmp=lr_base.id_long_read_qc_result_tmp
where lr.assay_type_key='final_nano_drop'
) as lr_final_nano_drop on lr_final_nano_drop.supplier_name=sample.supplier_name
;

emrojo commented 1 year ago

Tested in UAT with proddata, migration takes less than 30 seconds to run.

emrojo commented 1 year ago

Sent email to DBA awaiting for suggestions/comments if there are any, before scheduling to prod

emrojo commented 1 year ago

From DBA (David H.):

Creating the two non-unique indexes on a live database is not a problem, because MySQL can build secondary indexes without having to restrict access to the table. However, the ALTER TABLE to change the character set of the table may cause problems if there are active transactions using that table at the time. This is because the database server thread which is handling this query must acquire an exclusive table metadata lock before it can alter the table. It will be unable to do this whilst there are any open transactions that have queried or modified the data in that table. Moreover, it will block any other clients which begin new transactions which query the table. This can lead to a situation where many clients become stalled as they are forced to wait for the ALTER TABLE client to complete its work. For this reason, we strongly discourage changes to the schema whilst there are active client connections to the database. We recommend that you take steps to ensure that no other clients are using the unified warehouse database before you run the ALTER TABLE command. Metadata locking, and how to avoid it, is described in detail in the manual: https://dev.mysql.com/doc/refman/5.7/en/metadata-locking.html <https://dev.mysql.com/doc/refman/5.7/en/metadata-locking.html>

emrojo commented 1 year ago

The ALTER TABLE will be run by DBA team on 30th Aug separately from these migration. This migration is scheduled to be run on 29th August along with https://github.com/sanger/unified_warehouse/pull/532

emrojo commented 1 year ago

The migration is not scheduled anymore as it has been suggested we wait until the Mysql 8 upgrade is complete. For the moment this story will be on hold until we can re-schedule after the upgrade.

emrojo commented 1 year ago

It will be deployed on 12 September at 14:00.

Procedure is: deploy unified warehouse the branch v1.17.3.

This is how it has been done in uat:

 ansible-playbook -i environments/uat/hosts deploy_unified_warehouse.yml -e "github_branch=v1.17.3"

sanger / unified_warehouse

DPL-642-1 Add index to long_read_qc_result to be able to join with samples #540