Y24-247 - As a GSU PM (YL) I would like to verify the manually compiled QC data of plates from a supplier is consistent with the data already processed in Sangers systems so that we can select samples with good data for sequencing.

TWJW-SANGER commented 1 month ago

As a GSU PM (YL) I would like to verify the manually compiled QC data of plates from a supplier is consistent with the data already processed in Sangers systems so that we can select samples with good data for sequencing.

Background GSU have received additional QC data on samples that have already been received and processed through the sample ingestion process for Heron. This QC data has been manually compiled by the supplier and is known to have errors.

When we received the actual samples information about deep well plates this should have been inserted into the MLWH lighthouse_sample table. These deep well plates may have been stamped into shallow wells plates as part of the freezer space reduction cost saving activity.

To avoid sequencing samples without correct data RVI would like to check the consistency of data received in the QC files against that held in SequenceScape and MLWH.

Broadly the desired process is that:

GSU supply a file containing for each sample: Root Sample ID, Assumed Deep Well Barcode, Assumed Deep Well Location, Assumed Shallow Well Barcode, Assumed Shallow Well Location
We return a file containing for each sample: Root Sample ID, Assumed Deep Well Barcode, Assumed Deep Well Location, Assumed Shallow Well Barcode, Assumed Shallow Well Location, Actual Deep Well barcode, Actual Deep Well Location, Actual Shallow Well Barcode, Actual Shallow Well Location, Actual Match/s [True/False], Sanger Sample ID The data on deep well plates should be in the lighthouse_sample table database. The data on shallow well plates if they exist should be in the SequenceScape database.

Actual Match/s [True/False] is computed by comparing the Assumed data fields with the corresponding Actual data fields for both Deep and Shallow Well fields.

Sanger Sample ID is provided as a convenience for RVI.

It is expected that GSU will provide us with an input file every 2 weeks for 2 to 3 months.

Acceptance Criteria

[ ] The input file format is agreed with RVI.
[ ] The process of taking an input file and producing the output file is automated as much as possible.
[ ] Add script to psd-support-scripts ensuring documentation is provided so that any member of the team could run the process.
[ ] If Root Sample ID is not found all Actual Columns and Sanger Sample ID are NULL / blank & Actual Match is false.
[ ] If Deep or Shallow well plates are not found the corresponding Actual fields are NULL / blank.

A number of ways of implementing this are available, we want the simplest in terms of effort.

Stakeholders Ya L Anna G

TWJW-SANGER commented 3 weeks ago

Required for 600 samples coming in October

yh4-GSU commented 3 weeks ago

Hi Neil, may I ask a question regarding this ticket? For the "Actual Shallow Well Barcode, Actual Shallow Well Location" that PSD aims to return, what is the detailed logic behind? Is it by inputting the Root sample ID we provide and see in LIMS which shallow well plate and well that sample is sitting in? If the sample has gone through some journey/process, will the output data from PSD be the "latest" plate/well this sample is in? Best wishes, Ya-Lin

neilsycamore commented 2 weeks ago

Hi Ya-Lin, Searching by the root_sample_id is a very lengthy process as the sample_description attribute where it is stored is not indexed, there are 9.6m records and each one has to be searched one by one, well by well. The 'Actual' would be either a confirmation of the root sample present in the 'assumed' shallow plate::well OR the next aliquot (child well) of the deep plate::well.

neilsycamore commented 2 weeks ago

Ya-Lin I have a question. Do you need to provide us with dw AND sw data? Would supplying root_sample_id, dw_barcode, dw well location not be enough for us to report back: confirmation of dw data sw plate:well data and sample RVI name ?

yh4-GSU commented 2 weeks ago

Hi Neil,

The Root Sample ID vs dwpID/Well check is to see whether the raw data we obtained from the lighthouse lab is matching the data coming through Heron pipeline and stored in LIMS. It is required because both the raw data and the Heron platemap have been manually generated by the lighthouse lab and we have observed several mistakes.

The Root Sample ID vs SwpID/Well check is to see whether the sample is sitting in the plate/well that we believe it is in as the samples once received may have gone through several procedures (e.g. stamping, cherrypicking etc). This is why we'd wish to provide you the swpID/Well that we believe the sample is in and get an answer from PSD whether this is matching. In the case when it doesn't match, we'd wish to know where (swpID/Well) this sample is sitting according to LIMS. This says we are not interested in the child swp plate of the dwp plate but rather the "latest" location of the sample after its journey.

Hope I explain everything properly 😅.

We understand searching by the root sample ID would be a lengthy process (which we would never achieve by ourselves). We are hoping once a script is written for this job to get done automatically, it'll save all of our time and efforts 🤞. Thanks a million.

Best wishes, Ya-Lin

neilsycamore commented 2 weeks ago

Hi Ya-Lin Thank you for the above. Attached is a first attempt generated from the prod_7 data from RT807257 which was in the format root_sample_id, dw barcode, sw barcode, dw position. This data has the added benefit that the samples have been cherrypicked several times so we can see where the 'latest' location of the sample is.

Y24-247_RVI_sample_data_prod_7.xlsx

Let me know your thoughts please

yh4-GSU commented 1 week ago

Hi Neil,

thanks a lot for your hard work. The outcome file is looking good. As you mentioned, this batch of samples are useful for your test as they've been through 2 rounds of cherrypicking. I can see your outcome for the last plate/well/platetype catches the final CP plate! 👍

There's just one thing that samples shown as "Well empty" in SW sample name (column I) were not included in the further SW check (column L-R). Is it possible to mark these "Well empty" samples as "SW match NO" and include them in the downstream checks?

The format/info for the outcome spreadsheet as it is now is already great. If you wish to reduce the output info/columns, we can discuss further as well.

Best wishes, Ya-Lin

sanger / sequencescape

Y24-247 - As a GSU PM (YL) I would like to verify the manually compiled QC data of plates from a supplier is consistent with the data already processed in Sangers systems so that we can select samples with good data for sequencing. #4272