sanger / unified_warehouse

MIT License
2 stars 9 forks source link

DPL-1096 As TL (Tom) I would like to have a list of samples which have differing sample metaData in Warehouse2 compared to the MultiLIMS Warehouse database so that we can ensure data consistency when turning off Warehouse 2. #594

Closed TWJW-SANGER closed 1 month ago

TWJW-SANGER commented 6 months ago

Description As TL (Tom) I would like to have a list of samples which have differing sample metaData in Warehouse2 compared to the MultiLIMS Warehouse database so that we can ensure data consistency when turning off Warehouse 2.

Who the primary contacts are for this work Tom W Neil S

Knowledge or Stake holders Tom W Neil S David H (DBA team)

Additional context or information I would like a list of samples that exist in both Warehouse2 and MLWH samples table with differing sample metadata, together with the differences.

Metadata is stored in the sample table of MLWH in the set of columns listed below. I am unsure where the metadata is in Warehouse 2.

Rough set of tasks for doing this:

List of metadata columns in sample table in MLWH for comparison:

reference_genome organism accession_number common_name description taxon_id father mother replicate ethnicity gender cohort country_of_origin geographical_region sanger_sample_id control supplier_name public_name sample_visibility strain consent_withdrawn donor_id phenotype developmental_stage control_type sibling is_resubmitted date_of_sample_collection date_of_sample_extraction extraction_method purified purification_method customer_measured_concentration concentration_determined_by sample_type storage_conditions genotype age cell_type disease_state compound dose immunoprecipitate growth_condition organism_part time_point disease subject treatment date_of_consent_withdrawn marked_as_consent_withdrawn_by customer_measured_volume gc_content dna_source priority_level

neilsycamore commented 6 months ago

In warehouse2 (WH2) the sample and sample metadata are stored together in the current_samples table. More for context: In WH2 there are current tables (single row per id) and tables with multiple date stamped entries in this instance the samples table. Data stored in the current_samples table:

uuid binary(16) NOT NULL, internal_id int(11) NOT NULL, name varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL, reference_genome varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL, organism varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL, accession_number varchar(50) COLLATE utf8_unicode_ci DEFAULT NULL, common_name varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL, description text COLLATE utf8_unicode_ci, taxon_id varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL, father varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL, mother varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL, replicate varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL, ethnicity varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL, gender varchar(20) COLLATE utf8_unicode_ci DEFAULT NULL, cohort varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL, country_of_origin varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL, geographical_region varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL, is_current tinyint(1) NOT NULL, checked_at datetime NOT NULL, last_updated datetime DEFAULT NULL, created datetime DEFAULT NULL, sanger_sample_id varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL, control tinyint(1) DEFAULT NULL, empty_supplier_sample_name tinyint(1) DEFAULT NULL, supplier_name varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL, public_name varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL, sample_visibility varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL, strain varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL, updated_by_manifest tinyint(1) DEFAULT NULL, inserted_at datetime DEFAULT NULL, deleted_at datetime DEFAULT NULL, current_from datetime NOT NULL, current_to datetime DEFAULT NULL, consent_withdrawn tinyint(1) NOT NULL DEFAULT '0', donor_id varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL, UNIQUE KEY internal_id_idx (internal_id), UNIQUE KEY uuid_idx (uuid), KEY accession_number_idx (accession_number), KEY name_idx (name), KEY sanger_sample_id_idx (sanger_sample_id)

TWJW-SANGER commented 1 month ago

Thanks. I have gone back through my emails.

ToL reported the attached samples as having conflicting sample metadata between MLWH and Warehouse2 - I believe in the sense that Warehouse2 had a metadata field populated and MLWH did not.

Could you check if the samples they flag are consistent with your results?

Thanks.

query_examples.txt samples_with_missing_MLWH_data_tol.txt

sabrine33 commented 1 month ago

Thanks Thom,

Question : What is the main key between both tables? Is it the UUID or the name? I see different results when I join the tables by name compared to when I join them by UUID. In the previous work, I used the UUID as the joining key between tables.

Kind regards, Sabrine

From: Thomas Whiteley @.> Reply to: sanger/unified_warehouse @.> Date: Thursday 4 July 2024 at 14:57 To: sanger/unified_warehouse @.> Cc: Sabrine Akaichi Ep Alibi @.>, Assign @.***> Subject: Re: [sanger/unified_warehouse] DPL-1096 As TL (Tom) I would like to have a list of samples which have differing sample metaData in Warehouse2 compared to the MultiLIMS Warehouse database so that we can ensure data consistency when turning off Warehouse...

Thanks. I have gone back through my emails.

ToL reported the attached samples as having conflicting sample metadata between MLWH and Warehouse2 - I believe in the sense that Warehouse2 had a metadata field populated and MLWH did not.

Could you check if the samples they flag are consistent with your results?

Thanks.

query_examples.txt [github.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_user-2Dattachments_files_16099761_query-5Fexamples.txt&d=DwMFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=bWLZOqXwy59TuxH3g1YO38og5Y2aOL1Alb4f1wk6CTU&m=CQXWLhxLKyr8KvbKnMfd-gg8XJfAvrpSqnsxomQv1N8EQ5b-b9EzEBGMNwIMmsoH&s=kpifan3BQl1rMHde6Sx3QMdF1hmsBkyb8QjiLjH6xE0&e= samples_with_missing_MLWH_data_tol.txt [github.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_user-2Dattachments_files_16099762_samples-5Fwith-5Fmissing-5FMLWH-5Fdata-5Ftol.txt&d=DwMFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=bWLZOqXwy59TuxH3g1YO38og5Y2aOL1Alb4f1wk6CTU&m=CQXWLhxLKyr8KvbKnMfd-gg8XJfAvrpSqnsxomQv1N8EQ5b-b9EzEBGMNwIMmsoH&s=SoTR6UqGpRG1OtYDgXUPr3-KZbdYmz1XKxprTACxDmw&e=

— Reply to this email directly, view it on GitHub [github.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_sanger_unified-5Fwarehouse_issues_594-23issuecomment-2D2209062431&d=DwMFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=bWLZOqXwy59TuxH3g1YO38og5Y2aOL1Alb4f1wk6CTU&m=CQXWLhxLKyr8KvbKnMfd-gg8XJfAvrpSqnsxomQv1N8EQ5b-b9EzEBGMNwIMmsoH&s=Mai5sXZVl7CeHoyhYw0iv4wT89PnVyD00wGeA8GKyyw&e=, or unsubscribe [github.com]https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_BBFOZWWERGIFUWZXII2HYNLZKVIDVAVCNFSM6AAAAABDGI2HVWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBZGA3DENBTGE&d=DwMFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=bWLZOqXwy59TuxH3g1YO38og5Y2aOL1Alb4f1wk6CTU&m=CQXWLhxLKyr8KvbKnMfd-gg8XJfAvrpSqnsxomQv1N8EQ5b-b9EzEBGMNwIMmsoH&s=BGcxoUJPd16Ag42Gmdn71kEV314HjThEDY3U2xZbJyU&e=. You are receiving this because you were assigned.Message ID: @.***>


The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA.

sabrine33 commented 1 month ago

Some metadata fields are missing from MWH2:

  • phenotype
  • developmental_stage
  • control_type
  • sibling
  • is_resubmitted
  • date_of_sample_collection
  • date_of_sample_extraction
  • extraction_method
  • purified
  • purification_method
  • customer_measured_concentration
  • concentration_determined_by
  • sample_type
  • storage_conditions
  • genotype
  • age
  • cell_type
  • disease_state
  • compound
  • dose
  • immunoprecipitate
  • growth_condition
  • organism_part
  • time_point
  • disease
  • subject
  • treatment
  • date_of_consent_withdrawn
  • marked_as_consent_withdrawn_by
  • customer_measured_volume
  • gc_content
  • dna_source
  • priority_level

Common metadata:

reference_genome organism accession_number common_name description taxon_id father mother replicate ethnicity gender cohort country_of_origin geographical_region sanger_sample_id control supplier_name public_name sample_visibility strain consent_withdrawn donor_id