Closed yupan-ucla closed 3 years ago
(base) [tyamaguchi@ip-0A125212 delly-0.8.7]$ bcftools view filtered_somatic_SingleCtrlSample.bcf | grep -P "\tPASS\t" | wc -l
87
(base) [tyamaguchi@ip-0A125212 delly-0.8.7]$ bcftools view filtered_somatic_AllCtrlSamples.bcf | grep -P "\tPASS\t" | wc -l
87
Both files have 87 PASS-ed variants so it looks like Step3 didn't find any FPs here in Step2 although it added some additional info. We probably want to drop Step3 (and 4) anyway because it will be computationally quite heavy as # sample grows. RecSV can be used instead. Thoughts? @jmlivingstone @rhughwhite
Here is the code for RecSV https://github.com/uclahs-cds/BoutrosLab/tree/master/Resources/code/perl/NGS-Tools-RecSV Specfically the filtering module is here https://github.com/uclahs-cds/BoutrosLab/blob/master/Resources/code/perl/NGS-Tools-RecSV/trunk/lib/NGS/Tools/RecSV/Filter.pm
In the previous pipeline, we would call SVs in the control samples (seperate from the tumour samples) and then combine these SVs into a control panel. We would then filter the tumour samples against this panel (NGS::Tools::RecSV::Filter->filter_against_consolidated_normals). RecSV also used a module in RecSNV to filter against 'red and green' lists.
@jmlivingstone, Thank you for the info.
If I understand you correctly, we can re-implement the call-sSV as below:
Step1: for each control sample, call SVs.
delly call -x hg19.excl -o c1.bcf -g hg19.fa control1.bam
delly call -x hg19.excl -o c2.bcf -g hg19.fa control2.bam
…
delly call -x hg19.excl -o cN.bcf -g hg19.fa controlN.bam
Step2: merge the step1’s outputs into a control panel
delly merge -o sites.bcf c1.bcf c2.bcf ... cN.bcf
Step3: for one tumor sample, use the control panel to call/filter Should we run A or B?
A: delly call -g hg19.fa -v sites.bcf -o t1.geno.bcf -x hg19.excl T1.bam
B: delly call -g hg19.fa -v sites.bcf -o t1.geno.bcf -x hg19.excl T1.bam control1.bam
Maybe A is enough, since sites.bcf already contains the SV sites from control1.bam.
Step4: Post-filter for somatic SVs.
delly filter -f somatic -o t1.somatic.bcf -s samples.tsv t1.geno.bcf
Does the above workflow make sense?
BTW, you mentioned
RecSV also used a module in RecSNV to filter against 'red and green' lists.
What are the red and green lists? Shall I implement this in the new workflow? Thanks.
Hi Pan,
I think what you are proposing essentially does what the old RecSV did. As for Step 3, it's probably a good idea to see if A vs B changes the results. I don't know enough about how delly works anymore to make a suggestion. It has been updated a lot since we first started running it (ie there wasn't a filter command which is why recSV was created).
Red and green lists (also known as whitelists and blacklists - not sure if the lab has decided on which term to use @pboutros @tyamaguchi-ucla) are known SVs to include or exclude. This includes germline SVs from the 1K genomes project. Here is an example config of what we used to use /hot/svn/BoutrosLab/Resources/code/perl/NGS-Tools-RecSV/trunk/share/blacklists-and-whitelists.yaml
Red and green lists (also known as whitelists and blacklists - not sure if the lab has decided on which term to use @pboutros @tyamaguchi-ucla) are known SVs to include or exclude. This includes germline SVs from the 1K genomes project.
Preferred terminology is now "allowlist" and "denylist" -- with the idea that assigning "values" to colours is always ambiguous and unnecessary
Describe the issue Hi @tyamaguchi-ucla, @jmlivingstone, and @rhughwhite,
Somatic SV calling has 4 steps. Step1: At least one tumor sample and a matched control sample are required for SV discovery delly call -x hg19.excl -o t1.bcf -g hg19.fa tumor1.bam control1.bam Step2: Somatic pre-filtering requires a tab-delimited sample description file where the first column is the sample id (as in the VCF/BCF file) and the second column is either tumor or control. delly filter -f somatic -o t1.pre.bcf -s samples.tsv t1.bcf Step3: Genotype pre-filtered somatic sites across a larger panel of control samples to efficiently filter false postives and germline SVs. For performance reasons, this can be run in parallel for each sample of the control panel and you may want to combine multiple pre-filtered somatic site lists from multiple tumor samples. delly call -g hg19.fa -v t1.pre.bcf -o geno.bcf -x hg19.excl tumor1.bam control1.bam ... controlN.bam Step4: Post-filter for somatic SVs using all control samples. delly filter -f somatic -o t1.somatic.bcf -s samples.tsv geno.bcf
For Rupert’s test case (1 N/T pair and 14 control samples), the above 4 steps took 3d 13h 22m 56s. task_id hash native_id name status exit submit duration realtime %cpu peak_rss peak_vmem rchar wchar 1 fd/161142 24957 validate_file (2) COMPLETED 0 2021-07-30 17:33:19.065 15.3s 91ms 69.7% 9 MB 16 MB 1016.1 KB 228 B 2 58/a8ec5a 24959 validate_file (1) COMPLETED 0 2021-07-30 17:33:19.082 15.3s 51ms 112.2% 9.4 MB 16.3 MB 1016.1 KB 236 B 3 13/d2aa59 72282 call_sSV_Delly (1) COMPLETED 0 2021-08-10 22:58:45.140 22h 38m 32s 22h 38m 23s 99.0% 4.4 GB 4.6 GB 551.5 GB 6 MB 4 e9/333299 39600 query_sample_name_Bcftools (1) COMPLETED 0 2021-08-11 21:37:17.070 2.5s 934ms 94.0% 62.6 MB 4.4 GB 139.1 KB 339 B 5 66/b0d0b0 39933 filter_sSV_Delly_initialCall (1) COMPLETED 0 2021-08-11 21:37:19.629 2.5s 243ms 97.5% 9.3 MB 18.6 MB 6.2 MB 50 KB 6 3b/168fe8 40178 regenotype_sSV_Delly (1) COMPLETED 0 2021-08-11 21:37:22.395 2d 14h 44m 1s 2d 14h 43m 51s 89.8% 10.6 GB 11 GB 2.5 TB 92.4 KB 7 91/e03931 36972 filter_sSV_Delly_regenotyped (1) COMPLETED 0 2021-08-14 12:21:23.309 2s 74ms 78.5% 9.5 MB 19.6 MB 316.7 KB 91.4 KB 8 48/209997 37207 generate_sha512 (1) COMPLETED 0 2021-08-14 12:21:25.378 13.3s 61ms 96.0% 8.7 MB 15.7 MB 1.1 MB 452 B
To speed up the 3rd step, I thought this step probably could be decomposed to
Then we can merge the outputs of the above processes. However, suppose we need to process n N/T pairs, then the 3rd step will incur n**2 processes, which would be computationally infeasible.
As Taka suggested, maybe we can use Julie’s recSV to replace the 3rd step. @jmlivingstone, could you provide some details on how you implemented recSV in perl?
Also, I compared the outputs of the 2nd and 4th steps of this run, they look same. Below are the details.
Pls let me know if I missed something here. Thanks.
To Reproduce Steps to reproduce the behavior:
Expected behavior A clear and concise description of what you expected to happen.
Screenshots If applicable, add screenshots to help explain your problem.
Additional context Add any other context about the problem here.