There are some bugs in the SomalierPedigree QC stage and the way that relatedness QC flags are reported in the slack notifications.
How it works
Somalier Extract
Somalier extract downloads each sequencing group's CRAM (and index) and then runs the somalier extract command, specifying the sites file from references - /references/somalier/sites.hg38.vcf.gz (originally from the somalier repo). This creates the Somalier "fingerprint" file - e.g. CPG12345.cram.somalier.
Somalier Relate
The way the stage works is by fetching all somalier fingerprint files for the input sequencing groups and then running Somalier Relate over all pairwise combinations of input sequencing groups within the same dataset.
Makes use of a pedigree file, taken from Metamist via the write_ped_file method from the Dataset pipelines target. The pedigree file only contains the input sequencing groups.
Uses the verify-bamid files created by the CramQC stage. Each sequencing group's verify-bamid file is read and the FREEMIX percentage is extracted from the bottom of the file (Last line, 7th column via FREEMIX=cat verify-bamid.file | tail -n1 | cut -f7)
The FREEMIX value is compared with the hardcoded MAX_FREEMIX (= 0.04). If the sequencing group's freemix percentage is less than the maximum, the sequencing group's somalier fingerprint file is added to the list of inputs to Somalier Relate, and the sequencing group ID is added to the sequencing group IDs file.
Once all inputs sequencing groups have been decided (via the FREEMIX check above), the somalier relate command is invoked. This generates several outputs:
{output_prefix}.samples.tsv - a .ped like file with extra QC columns
Saved to gs://cpg-dataset-main/somalier/cram/<SG_HASH>/dataset.samples.tsv
{output_prefix}.pairs.tsv - IBS and other relatedness stats for all possible sample pairs
Saved to gs://cpg-dataset-main/somalier/cram/<SG_HASH>/dataset.pairs.tsv
{output_prefix}.html - interactive html
Saved to gs://cpg-dataset-main-web/cram-somalier-pedigree.html
Check Pedigree script
These results are used by the check_pedigree.py script from /cpg_workflows/python_scripts/. This script looks through the relatedness information produced by Somalier Relate, and formulates a Slack message to post to a private channel.
Read the somalier samples.tsv and pairs.tsv into pandas DataFrames.
Read the pedigree from Metamist and the somalier pedigree samples.tsv into expected_ped and inferred_ped respectively using peddy.Ped (from the creator of Somalier)
Check each expected vs inferred sex value and report on any mismatches or missing values
Iterate through the pairs of samples in each pedigree to find mismatches
Get the expected relationship between a pair of samples using peddy.Ped.relation(expected_ped_sample1, expected_ped_sample2)
Do the Same for the inferred relationship between the same pair of samples using the inferred pedigree
Ped.relation will return a string value for the type of relationship between the individuals. e.g. "mom-dad" if two samples have the same children. e.g. "unrelated" if the family ID for each sample is different.
If samples are "unrelated" but the relatedness score is > 0.1 (from pairs.tsv), then flag this as an inferred relationship where none is expected.
Formulate a report with all the flagged issues and send it to Slack.
There are some bugs in the SomalierPedigree QC stage and the way that relatedness QC flags are reported in the slack notifications.
How it works
Somalier Extract
Somalier extract downloads each sequencing group's CRAM (and index) and then runs the
somalier extract
command, specifying the sites file from references -/references/somalier/sites.hg38.vcf.gz
(originally from thesomalier
repo). This creates the Somalier "fingerprint" file - e.g.CPG12345.cram.somalier
.Somalier Relate
The way the stage works is by fetching all somalier fingerprint files for the input sequencing groups and then running Somalier Relate over all pairwise combinations of input sequencing groups within the same dataset.
write_ped_file
method from theDataset
pipelines target. The pedigree file only contains the input sequencing groups.verify-bamid
files created by theCramQC
stage. Each sequencing group'sverify-bamid
file is read and theFREEMIX
percentage is extracted from the bottom of the file (Last line, 7th column viaFREEMIX=cat verify-bamid.file | tail -n1 | cut -f7
)FREEMIX
value is compared with the hardcodedMAX_FREEMIX
(= 0.04
). If the sequencing group's freemix percentage is less than the maximum, the sequencing group's somalier fingerprint file is added to the list of inputs to Somalier Relate, and the sequencing group ID is added to the sequencing group IDs file.Once all inputs sequencing groups have been decided (via the FREEMIX check above), the
somalier relate
command is invoked. This generates several outputs:{output_prefix}.samples.tsv
- a .ped like file with extra QC columnsgs://cpg-dataset-main/somalier/cram/<SG_HASH>/dataset.samples.tsv
{output_prefix}.pairs.tsv
- IBS and other relatedness stats for all possible sample pairsgs://cpg-dataset-main/somalier/cram/<SG_HASH>/dataset.pairs.tsv
{output_prefix}.html
- interactive htmlgs://cpg-dataset-main-web/cram-somalier-pedigree.html
Check Pedigree script
These results are used by the
check_pedigree.py
script from/cpg_workflows/python_scripts/
. This script looks through the relatedness information produced by Somalier Relate, and formulates a Slack message to post to a private channel.samples.tsv
andpairs.tsv
into pandas DataFrames.samples.tsv
intoexpected_ped
andinferred_ped
respectively usingpeddy.Ped
(from the creator of Somalier)peddy.Ped.relation(expected_ped_sample1, expected_ped_sample2)
Ped.relation
will return a string value for the type of relationship between the individuals. e.g."mom-dad"
if two samples have the same children. e.g."unrelated"
if the family ID for each sample is different.pairs.tsv
), then flag this as an inferred relationship where none is expected.What's not working
Unnecessary Slack message warnings of note:
Both of these situations should be valid and not cause errors (I think).
Here is an example Slack post containing these warnings: https://centrepopgen.slack.com/archives/C03BHD3EKH9/p1718751311646699