shishenyxx / DeepMosaic

DeepMosaic is a deep-learning-based mosaic single nucleotide classification tool without the need of matched control information.
https://www.nature.com/articles/s41587-022-01559-w
Other
41 stars 5 forks source link

Recommendations for calling candidate variants #2

Closed wdecoster closed 3 years ago

wdecoster commented 3 years ago

Hi,

This looks like a great tool and I'd love to get started with it on some of our datasets. Do you have best practice recommendations for calling candidate variants upfront of your tool?

Thanks, Wouter

shishenyxx commented 3 years ago

Hi Wouter,

Thank you for your interests in DeepMosaic, if you are running it from raw fastqs, we recommend to follow this pipeline (adapted from GATK best practice) for data pre-processing, and this pipeline for MuTect2 single mode. Then follow the user manual to proceed with DeepMosaic, current parameters are optimized for WGS, performance is estimated in this preprint. If you are starting from bams, make sure you did indel realignment and BQSR following the GATK best practice. Alternatively, you can also follow the first part of this pipeline, which is also available on GitHub, performance is also estimated in the preprint. This pipeline is benchmarked on data from both WGS and WES. Good luck and tell me if you have any further questions!

Best,

Xiaoxu

wdecoster commented 3 years ago

Hi Xiaoxu,

Thanks for the comprehensive answer. It's exome sequencing, so I'm also eager to learn how your findings are as also asked in https://github.com/Virginiaxu/DeepMosaic/issues/1 I'll have a look at the suggested pipelines!

Wouter

shishenyxx commented 2 years ago

Hi Wouter,

Just an update that per our assessment, DeepMosaic (the current genome version) doubled the validation rate compared with our previous exome mosaic calling pipelines. While this is still only 40%. We think that's already providing a much more convincing approach.

Best,

Xiaoxu

wdecoster commented 2 years ago

Hi Xiaoxu,

Thanks for following up, that sounds really promising.

Regards, Wouter

astulaaa commented 1 year ago

Dear Wouter,

This thread is very helpful, however, I still have questions. In GATK, it appears that for gVCF and VCF different parameter thresholds apply for variant calling, especially with "-stand_call_conf" parameter. In gVCF the default is 0 and for VCF default is 10. Would you recommend to set it to 0 for VCF as well ? Is there any recommendation for such variant quality threshold or any variant quality threshold for that matter? ( I am thinking to skip gVCF step because it seems not relevant for using this tool) Specifically, I mean this parameter: https://github.com/broadinstitute/gatk-docs/blob/master/blog-2012-to-2019/2016-12-12-Version_highlights_for_GATK_version_3.7.md

Thank you, Asta

shishenyxx commented 1 year ago

Dear Wouter,

This thread is very helpful, however, I still have questions. In GATK, it appears that for gVCF and VCF different parameter thresholds apply for variant calling, especially with "-stand_call_conf" parameter. In gVCF the default is 0 and for VCF default is 10. Would you recommend to set it to 0 for VCF as well ? Is there any recommendation for such variant quality threshold or any variant quality threshold for that matter? ( I am thinking to skip gVCF step because it seems not relevant for using this tool) Specifically, I mean this parameter: https://github.com/broadinstitute/gatk-docs/blob/master/blog-2012-to-2019/2016-12-12-Version_highlights_for_GATK_version_3.7.md

Thank you, Asta

Hi Asta,

Thank you for reaching out and thank you for your interest in DeepMosaic! I believe you are talking about the input VCF for DeepMosaic, generated by GATK MuTect2 single mode or GATK haplotypecaller with polidy set to 50 or 100. The file generated should be VCFs. Whereas the purpose of generating a gVCF is to genotype the variants that all called from all samples and see what's the genotype of each of the variants (most of which are actually SNPs you want to compare/genotype) between different individuals. Pipelines we already established for MuTect2 (both leave-one-out panel-of-normal and completely independent panel-of-normal strategies) can be found at https://github.com/shishenyxx/Sperm_control_cohort_mosaicism. Alternatively, you can try the BSMN common pipeline https://github.com/bsmn to get the input.

Best,

Xiaoxu

astulaaa commented 1 year ago

Dear Wouter,

Thank you for your prompt response. Yes, I am talking about the input VCF for DeepMosaic. The BSMN pipeline you have shared is using HaplotypeCaller but with gVCF output format (I am talking about "-ERC GVCF" option used there) which made me confused. I guess there is no reason to question the parameters as you already confirmed it is compatible with DeepMosaic. Looking forward to do the analysis!

Thanks again, Asta

wdecoster commented 1 year ago

Xiaoxu, Not Wouter