uclahs-cds / pipeline-call-sSV

Nextflow pipeline to call somatic structural variants using DELLY and Manta
https://uclahs-cds.github.io/pipeline-call-sSV/
GNU General Public License v2.0
1 stars 0 forks source link

Regenotype-sSV is computationally infeasible #7

Closed yupan-ucla closed 3 years ago

yupan-ucla commented 3 years ago

Describe the issue Hi @tyamaguchi-ucla, @jmlivingstone, and @rhughwhite,

Somatic SV calling has 4 steps. Step1: At least one tumor sample and a matched control sample are required for SV discovery delly call -x hg19.excl -o t1.bcf -g hg19.fa tumor1.bam control1.bam Step2: Somatic pre-filtering requires a tab-delimited sample description file where the first column is the sample id (as in the VCF/BCF file) and the second column is either tumor or control. delly filter -f somatic -o t1.pre.bcf -s samples.tsv t1.bcf Step3: Genotype pre-filtered somatic sites across a larger panel of control samples to efficiently filter false postives and germline SVs. For performance reasons, this can be run in parallel for each sample of the control panel and you may want to combine multiple pre-filtered somatic site lists from multiple tumor samples. delly call -g hg19.fa -v t1.pre.bcf -o geno.bcf -x hg19.excl tumor1.bam control1.bam ... controlN.bam Step4: Post-filter for somatic SVs using all control samples. delly filter -f somatic -o t1.somatic.bcf -s samples.tsv geno.bcf

For Rupert’s test case (1 N/T pair and 14 control samples), the above 4 steps took 3d 13h 22m 56s. task_id hash native_id name status exit submit duration realtime %cpu peak_rss peak_vmem rchar wchar 1 fd/161142 24957 validate_file (2) COMPLETED 0 2021-07-30 17:33:19.065 15.3s 91ms 69.7% 9 MB 16 MB 1016.1 KB 228 B 2 58/a8ec5a 24959 validate_file (1) COMPLETED 0 2021-07-30 17:33:19.082 15.3s 51ms 112.2% 9.4 MB 16.3 MB 1016.1 KB 236 B 3 13/d2aa59 72282 call_sSV_Delly (1) COMPLETED 0 2021-08-10 22:58:45.140 22h 38m 32s 22h 38m 23s 99.0% 4.4 GB 4.6 GB 551.5 GB 6 MB 4 e9/333299 39600 query_sample_name_Bcftools (1) COMPLETED 0 2021-08-11 21:37:17.070 2.5s 934ms 94.0% 62.6 MB 4.4 GB 139.1 KB 339 B 5 66/b0d0b0 39933 filter_sSV_Delly_initialCall (1) COMPLETED 0 2021-08-11 21:37:19.629 2.5s 243ms 97.5% 9.3 MB 18.6 MB 6.2 MB 50 KB 6 3b/168fe8 40178 regenotype_sSV_Delly (1) COMPLETED 0 2021-08-11 21:37:22.395 2d 14h 44m 1s 2d 14h 43m 51s 89.8% 10.6 GB 11 GB 2.5 TB 92.4 KB 7 91/e03931 36972 filter_sSV_Delly_regenotyped (1) COMPLETED 0 2021-08-14 12:21:23.309 2s 74ms 78.5% 9.5 MB 19.6 MB 316.7 KB 91.4 KB 8 48/209997 37207 generate_sha512 (1) COMPLETED 0 2021-08-14 12:21:25.378 13.3s 61ms 96.0% 8.7 MB 15.7 MB 1.1 MB 452 B

To speed up the 3rd step, I thought this step probably could be decomposed to

            delly call -g hg19.fa -v t1.pre.bcf -o geno.bcf -x hg19.excl tumor1.bam control1.bam
            delly call -g hg19.fa -v t1.pre.bcf -o geno.bcf -x hg19.excl tumor1.bam control2.bam
            …
            delly call -g hg19.fa -v t1.pre.bcf -o geno.bcf -x hg19.excl tumor1.bam controlN.bam

Then we can merge the outputs of the above processes. However, suppose we need to process n N/T pairs, then the 3rd step will incur n**2 processes, which would be computationally infeasible.

As Taka suggested, maybe we can use Julie’s recSV to replace the 3rd step. @jmlivingstone, could you provide some details on how you implemented recSV in perl?

Also, I compared the outputs of the 2nd and 4th steps of this run, they look same. Below are the details.

ls -l /hot/pipelines/development/slurm/call-sSV/output_rupert2/delly-0.8.7
-rw-r--r--. 1 yupan domain users   48777 Aug 11 21:37 filtered_somatic_SingleCtrlSample.bcf
-rw-r--r--. 1 yupan domain users   90284 Aug 14 12:21 filtered_somatic_AllCtrlSamples.bcf
docker run -it -v /hot/pipelines/development/slurm/call-sSV/output_rupert2/delly-0.8.7:/tmp continuumio/miniconda /bin/bash

# run the following commands inside Docker
conda install -y -c bioconda vcftools
conda install -y -c bioconda bcftools
conda install -y -c bioconda bedtools
conda install -y -c bioconda perl-vcftools-vcf
conda install -y -c bioconda tabix
conda install -y -c bioconda snpsift

# total number of SNPs
bcftools view /tmp/filtered_somatic_SingleCtrlSample.bcf | grep -v "^#" | wc -l
157
bcftools view /tmp/filtered_somatic_AllCtrlSamples.bcf | grep -v "^#" | wc -l
157

# total number of unique positions, indicating that several sites have two or more alternate alleles
bcftools view /tmp/filtered_somatic_SingleCtrlSample.bcf | grep -v "^#" | cut -f2 | sort -u | wc -l
157
bcftools view /tmp/filtered_somatic_AllCtrlSamples.bcf | grep -v "^#" | cut -f2 | sort -u | wc -l
157

# distribution of ref vs. alt alleles
# notice the single dinucleotide change TG -> CG and an unnormalised variant AT -> AC
bcftools view /tmp/filtered_somatic_SingleCtrlSample.bcf | grep -v "^#" | cut -f4,5 | sort | uniq -c | sort -k1rn

bcftools view /tmp/filtered_somatic_AllCtrlSamples.bcf | grep -v "^#" | cut -f4,5 | sort | uniq -c | sort -k1rn

bcftools view /tmp/filtered_somatic_SingleCtrlSample.bcf > /tmp/filtered_somatic_SingleCtrlSample.vcf
bcftools view /tmp/filtered_somatic_AllCtrlSamples.bcf > /tmp/filtered_somatic_AllCtrlSamples.vcf

bgzip /tmp/filtered_somatic_SingleCtrlSample.vcf > /tmp/filtered_somatic_SingleCtrlSample.vcf.gz
bgzip /tmp/filtered_somatic_AllCtrlSamples.vcf > /tmp/filtered_somatic_AllCtrlSamples.vcf.gz

tabix /tmp/filtered_somatic_SingleCtrlSample.vcf.gz
tabix /tmp/filtered_somatic_AllCtrlSamples.vcf.gz

vcf-compare /tmp/filtered_somatic_SingleCtrlSample.vcf.gz /tmp/filtered_somatic_AllCtrlSamples.vcf.gz
# This file was generated by vcf-compare.
# The command line was: vcf-compare(v0.1.14-12-gcdb80b8) /tmp/filtered_somatic_SingleCtrlSample.vcf.gz /tmp/filtered_somatic_AllCtrlSamples.vcf.gz
#
#VN 'Venn-Diagram Numbers'. Use `grep ^VN | cut -f 2-` to extract this part.
#VN The columns are:
#VN        1  .. number of sites unique to this particular combination of files
#VN        2- .. combination of files and space-separated number, a fraction of sites in the file
VN          157         /tmp/filtered_somatic_AllCtrlSamples.vcf.gz (100.0%)                /tmp/filtered_somatic_SingleCtrlSample.vcf.gz (100.0%)
#SN Summary Numbers. Use `grep ^SN | cut -f 2-` to extract this part.
SN           Number of REF matches:               157
SN           Number of ALT matches:               157
SN           Number of REF mismatches:        0
SN           Number of ALT mismatches:        0
SN           Number of samples in GT comparison:    0

Pls let me know if I missed something here. Thanks.

To Reproduce Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Additional context Add any other context about the problem here.

tyamaguchi-ucla commented 3 years ago

(base) [tyamaguchi@ip-0A125212 delly-0.8.7]$ bcftools view filtered_somatic_SingleCtrlSample.bcf | grep -P "\tPASS\t" | wc -l 87 (base) [tyamaguchi@ip-0A125212 delly-0.8.7]$ bcftools view filtered_somatic_AllCtrlSamples.bcf | grep -P "\tPASS\t" | wc -l 87

Both files have 87 PASS-ed variants so it looks like Step3 didn't find any FPs here in Step2 although it added some additional info. We probably want to drop Step3 (and 4) anyway because it will be computationally quite heavy as # sample grows. RecSV can be used instead. Thoughts? @jmlivingstone @rhughwhite

jmlivingstone commented 3 years ago

Here is the code for RecSV https://github.com/uclahs-cds/BoutrosLab/tree/master/Resources/code/perl/NGS-Tools-RecSV Specfically the filtering module is here https://github.com/uclahs-cds/BoutrosLab/blob/master/Resources/code/perl/NGS-Tools-RecSV/trunk/lib/NGS/Tools/RecSV/Filter.pm

In the previous pipeline, we would call SVs in the control samples (seperate from the tumour samples) and then combine these SVs into a control panel. We would then filter the tumour samples against this panel (NGS::Tools::RecSV::Filter->filter_against_consolidated_normals). RecSV also used a module in RecSNV to filter against 'red and green' lists.

yupan-ucla commented 3 years ago

@jmlivingstone, Thank you for the info.

If I understand you correctly, we can re-implement the call-sSV as below:

Step1: for each control sample, call SVs.

delly call -x hg19.excl -o c1.bcf -g hg19.fa control1.bam
delly call -x hg19.excl -o c2.bcf -g hg19.fa control2.bam
…
delly call -x hg19.excl -o cN.bcf -g hg19.fa controlN.bam

Step2: merge the step1’s outputs into a control panel

delly merge -o sites.bcf c1.bcf c2.bcf ... cN.bcf

Step3: for one tumor sample, use the control panel to call/filter Should we run A or B?

A: delly call -g hg19.fa -v sites.bcf -o t1.geno.bcf -x hg19.excl T1.bam
B: delly call -g hg19.fa -v sites.bcf -o t1.geno.bcf -x hg19.excl T1.bam control1.bam

Maybe A is enough, since sites.bcf already contains the SV sites from control1.bam.

Step4: Post-filter for somatic SVs.

delly filter -f somatic -o t1.somatic.bcf -s samples.tsv t1.geno.bcf

Does the above workflow make sense?

BTW, you mentioned

RecSV also used a module in RecSNV to filter against 'red and green' lists.

What are the red and green lists? Shall I implement this in the new workflow? Thanks.

jmlivingstone commented 3 years ago

Hi Pan,

I think what you are proposing essentially does what the old RecSV did. As for Step 3, it's probably a good idea to see if A vs B changes the results. I don't know enough about how delly works anymore to make a suggestion. It has been updated a lot since we first started running it (ie there wasn't a filter command which is why recSV was created).

Red and green lists (also known as whitelists and blacklists - not sure if the lab has decided on which term to use @pboutros @tyamaguchi-ucla) are known SVs to include or exclude. This includes germline SVs from the 1K genomes project. Here is an example config of what we used to use /hot/svn/BoutrosLab/Resources/code/perl/NGS-Tools-RecSV/trunk/share/blacklists-and-whitelists.yaml

pboutros commented 3 years ago

Red and green lists (also known as whitelists and blacklists - not sure if the lab has decided on which term to use @pboutros @tyamaguchi-ucla) are known SVs to include or exclude. This includes germline SVs from the 1K genomes project.

Preferred terminology is now "allowlist" and "denylist" -- with the idea that assigning "values" to colours is always ambiguous and unnecessary