hail-is / hail

Cloud-native genomic dataframes and batch computing
https://hail.is
MIT License
978 stars 244 forks source link

create tool for finding de novo variants in vcf containing trios #34

Closed cseed closed 7 years ago

cseed commented 8 years ago

From @jbloom22 on September 1, 2015 18:20

Kaitlin has written a python tool that can serve as model. From Kaitlin:

The most recent version is attached (3.93). The biggest issue that I've yet to resolve is how to handle multi-allelic lines above tri-allelic. Gets into nightmare territory quickly.

To run this, you'll need three things: 1) VCF of interest 2) PED file for the families in the VCF 3) The ESP variant counts file that I made (.gz for the moment since it is so large) You can find this file here: /humgen/atgu1/fs03/wip/kaitlin/all_ESP_counts_5.28.13.txt

The command line argument should look like this: python de_novo_finder_3.py all_ESP_counts_5.28.13.txt

I suggest specifying an output file. There are a few optional flags that you can use to adjust things in the script.

-v, --annotatevariants_VEP: If you have VEP annotations in the ANNOTATION column of the VCF, this will pull out and print the gene name and mutation type

-t, --thresh: The PL threshold set for the next most likely genotype in the child. This gets after how confident you want the het call to be in the child. Default is a PL threshold of 20, but you can adjust that up or down if you'd like.

-c, --minchildAB: I require that the heterozygous child has at least 20% alternative reads. You can adjust that with this.

-d, --depthratio: I require that the depth of coverage in the child is at least 1/10th that of the combined parental depth. You can adjust that with this (integer input for the 1/x).

-m, --prob(dn)metric: The minimum p(DN) that you will accept. I have it set at 0.05 and you could adjust it up. Due to the validation likelihoods that were added, you won't get anything below 0.05.

-p, --maxparentAB: I require that the parents, who should both be homozygous reference, have no more than 5% alternative reads. You can adjust that with this flag.

-a, --annotatevariants: If you have SnpEff annotations in the ANNOTATION column of the VCF, this will pull out and print the gene name and mutation type

Copied from original issue: cseed/hail#44

cseed commented 7 years ago

Pending review: https://github.com/hail-is/hail/pull/1870