google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
BSD 3-Clause "New" or "Revised" License
3.17k stars 713 forks source link

question about DeepTrio #699

Closed sophienguyen01 closed 1 year ago

sophienguyen01 commented 1 year ago

Hi,

I want to ask if DeepTrio provides an option to generate denovo variants. I ran the tool using exact files provided in the tutorial, receive 3 vcf and 3gvcf but that's about it. How can I get the denovo variant from the child vcf output?

Thanks

pgrosu commented 1 year ago

Hi Sophie,

You have a few of options:

1) The first option is like Andrew mentioned in a previous post, by running DeepVariant on each of them and use GLnexus to merge them and identify the de novo mutations from the joint call file.

2) DeepTrio outputs the child VCF as noted by the flag --output_vcf_child. Then you would need to compare those variants across multiple samples (with some truth sets) against the parents, to ensure they are truly DNM and are not false positives. That is quite a bit of work to perform properly.

3) You can use things external tools, of which there are many :)

Hope it helps, Paul

sophienguyen01 commented 1 year ago

Thank you for your answer

danielecook commented 1 year ago

I will close this issue now. If you have further questions feel free to reopen. Thank you @pgrosu.

sophienguyen01 commented 1 year ago

HI @pgrosu ,

For option1, after merging into a joint call file, which tool can I use to call de novo variants? According to this tutorial, RTG mendelian is only able to calculate non-mendelian rate, not identify de novo mutations.

For option3, did you mean to use other tool after DeepTrio producing vcf for each sample? I have done a bit of research and there are not that many and most of the tools are old and not well-maintained. Can you list some of the recommendations here?

Thanks, Sophie

pgrosu commented 1 year ago

Hi Sophie,

So as you know, besides the genetic information passed on from the parents, each of us is born with an additionally small number of novel genetic changes called de novo mutations (i.e. from environmental effects, etc). These traits are thus not passed from the parents, thus violating Mendelian inheritance.

So when you use rtg-tools mendelian with the --output flag, it will save an updated VCF file annotated with calls violating Mendelian inheritance, thus highlighting the de novo mutations. The information (header) fields in these updated annotated VCF files will have the following:

##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=MCV,Number=.,Type=String,Description="Variant violates mendelian inheritance constraints">
##INFO=<ID=MCU,Number=.,Type=String,Description="Mendelian consistency status can not be determined">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DN,Number=1,Type=String,Description="De novo allele">
##FORMAT=<ID=MCP,Number=.,Type=String,Description="Describes the expected genotype ploidy in cases where the given genotype does not match the expected ploidy">

Each de novo call that violated Mendelian inhertance will be annotated like this:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  father  mother  son1    son2    daughter1       daughter2-initial       daughter2
Chr1    4917    .       A       G       .       .       MCV=daughter2:0|0+0|0->0|1      GT:DN   0|0     0|0     0|0     0|0     0|0     0|0     0|1:Y
Chr1    15214   .       G       C       .       .       MCV=daughter2:0|0+0|0->1|0      GT:DN   0|0     0|0     0|0     0|0     0|0     0|0     1|0:Y
Chr2    4883    .       T       G       .       .       MCV=daughter2:0|0+0|0->0|1      GT:DN   0|0     0|0     0|0     0|0     0|0     0|0     0|1:Y
Chr2    11369   .       G       A       .       .       MCV=daughter2:0|0+0|0->0|1      GT:DN   0|0     0|0     0|0     0|0     0|0     0|0     0|1:Y
Chr3    11754   .       A       G       .       .       MCV=daughter2:0|0+0|0->0|1      GT:DN   0|0     0|0     0|0     0|0     0|0     0|0     0|1:Y
Chr4    37470   .       C       T       .       .       MCV=daughter2:0|0+0|0->1|0      GT:DN   0|0     0|0     0|0     0|0     0|0     0|0     1|0:Y

Below are a few tools that can also perform trio analysis (generating their own VCF), or can perform VCF refinement based on pedigree information:

The key point to take away from this is not that there are options, but how these options internally work to infer the genotype and its probability given the data. Some work better with longer reads, and some with shorter reads. You want to play with them to get a feel of what is happening given different data. If you are curious, you can read the papers and mathematics behind each approach, and you'll be surprised by their similarity in approaches of inferring the call and its probability (quality). I have included a list of papers with links in the reference section below.

Now if the above is too easy, and you want to make de novo variant calling more exciting, you can use the glnexus with the config --config DeepVariant_unfiltered, which is basically the following Yaml config file indicating to GLnexus to operate under specific parameters conditions. So when you perform GLnexus joint variant calling, you will get the three sample columns (father/mother/child) in your joint VCF. To determine a de novo call, you just look for genotypes that would not follow Mendelian inheritance, such as 0/0 0/0 0/1, such as:

chr7    54624683        chr7_54624683_A_AATC    A       AATC    27      .       AF=0.166667;AQ=27       GT:DP:AD:GQ:PL:RNC      0/0:39:22,16:28:27,0,48:..      0/0:40:40,0:50:0,120,1199:..    0/1:28:28,0:50:0,90,899:..

Though keep in mind DeepTrio/GLnexus might produce false positives - based on low read quality (low MAPQ), or other factors such as over-representation of multi-site aligned reads - where such a call might be labeled 0/1 0/0 0/0, with IGV supporting more the call of 0/1 0/1 0/0. Otherwise if the read quality is good, and alignments are unique with proper coverage then it might actually be de novo, though the proband (child) calls are the more interesting ones. For this you would need to have more samples to ensure the calls are not false positives, with further IGV inspection and assay validation. If this might be a bit too fun, feel free to skip it, but it's here if you are curious to dive deeper in the possible de novo calls from DeepTrio/GLnexus.

Basically the big idea is take it slow and have fun to get the most of out it, as with many moving parts (programs + parameters) and varied data you want to be confident in the calls - which can take a lot of finesse. With super-clean data, that's not such a big deal - but that's not why we use these tools :)

Hope it helps, Paul

References

[1] RTG Tools Manual [2] dv-trio: a family-based variant calling pipeline using DeepVariant [3] FamSeq: A Variant Calling Program for Family-Based Sequencing Data Using Graphics Processing Units [4] DeepTrio: Variant Calling in Families Using Deep Learning [5] A unified haplotype-based method for accurate and comprehensive variant calling (This is the Octopus paper.) [6] DeNovoGear: de novo indel and point mutation discovery and phasing

sophienguyen01 commented 1 year ago

Thank you @pgrosu, I totally omitted the output of modified vcf from rtg mendelian. This is exactly what I am looking for

pgrosu commented 1 year ago

Hi Sophie,

You are very welcome, and that is absolutely understandable. Feel free to reach out again if you have more questions.

Paul