Open mkohailan opened 4 years ago
That is confusing. I don't see any mistakes in your command and I don't think you'd see that particular message if bad inputs had caused an issue (although I could be wrong and will keep thinking about it). The most likely explanation is that the particular informative site you mentioned didn't pass quality control filters built into unfazed. The main QC requirements (for a heterozygous variant) are:
For homozygous calls the same filters apply except the read allele balance must be between 0.0-0.2 or 0.8-1.0 (depending on whether it's a hom-alt or a hom-ref.
Please let me know if this is helpful. I'd be happy to dig more into it with you, especially if the issue is a bug in my code I can fix.
What variant caller was used to create the VCF? Does it have AB, DP and GQ attributes in the genotype columns?
Thanks for your reply
I am using FreeBayes (default run) to call the variants.
After your comments, I ran FreeBayes with --genotype-qualities
option to write the GQ in the genotype field of each sample. So, now I have DP and GQ written in the genotype column for each sample, and the AB in the INFO column.
I ran a test and it worked for some of my variants (including the one in the above example). So thanks for your help
For one of the variants (below), I can see that both the DNM site and the informative site can pass these filters, but still not phased. Are there any other criteria used?
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- 9 | 88075348 | . | A | G | 174.68 | . | AB=0.40625;………. | GT:AD:AO:DP:GQ:PL:QA:QR:RO | 0/0:37,0:0:38:99:0,111,1311:0:1465:37 | 0/0:41,0:0:41:99:0,123,1303:0:1486:41 | 0/1:19,13:13:32:99:248,0,554:448:724:19 9 | 88075353 | . | A | G | 1525 | . | AB=0.462069;………. | GT:AD:AO:DP:GQ:PL:QA:QR:RO | 0/0:37,0:0:37:99:0,111,1252:0:1393:37 | 0/1:23,18:18:41:99:455,0,648:691:857:23 | 0/1:19,14:14:33:99:259,0,565:496:739:19
By the way, is there an option to play with the quality control filters set by the tool?
A possible reason a variant like this one might not be phased could be if there are multiple informative sites in the region that give contradictory results, possibly because a recombination event or sequencing error occurred. If you use the --include-ambiguous
option on the command line it will report these variants as well, so that might help.
Quality control filters are not currently exposed to the user, although this may change shortly. I'm also considering ways to improve the information given to the user about why certain variants aren't phased, although I want to avoid overwhelmingly large outputs as well.
Great! looking for the new versions then
Thank you
Hi,
Thanks for making this tool. I tried it out yesterday on some data and the output for all sites is:
No usable informative sites for variant “variant name.”
We know the parent-of-origin for many of the variants based on assessing the read data, so would expect this program to find at least some of them. Based on your preprint, it looks like the number of variants where parent-of-origin can be determined is higher than in previous papers/methods so that is why we hope to use it.
The command is as follows:
/path/to/unfazed -d childname_denovo.bed -s childname.trio.vcf.gz -p childname.ped --bam-pairs childname:/path/to/childname.cram dadname:/path/to/dadname.cram momname:/path/to/momname.cram -r /path/to/GRCh38_full_analysis_set_plus_decoy_hla.fa
The input of the bed file is as follows (only the top line shown for space reasons):
chr1 7326507 7326507 childname SNV
The input of the childname.trio.vcf is a full genome-wide vcf from DeepVariant (the FORMAT field is as follows: GT:DP:AD:GQ:PL:RNC). I have also tried a VCF from GATK and also tried adding the AB in to one of these VCFs. None of these changes worked. I have also tried only giving it the child cram and not the parents … still the same result.
The ped format is standard plink format
Would like to run this tool and any insight is appreciated.
Thanks,
Tychele
I should note that I have also tried the bed file as follows:
chr1 7326506 7326507 childname SNV
Hmm, that's disappointing. It's possible there's an instability I'm not aware of. Could you try running with -t 1
? This option requires it to run with just a single thread but allows improved exception handling.
Okay, added -t 1
to end and now it says
UNFAZED v1.0.0 Traceback (most recent call last): File "/opt/conda/bin/unfazed", line 10, in <module> sys.exit(main()) File "/opt/conda/lib/python3.6/site-packages/unfazed/__main__.py", line 237, in main unfazed(args) File "/opt/conda/lib/python3.6/site-packages/unfazed/unfazed.py", line 648, in unfazed args.split_error_margin, File "/opt/conda/lib/python3.6/site-packages/unfazed/snv_phaser.py", line 419, in phase_snvs split_error_margin, File "/opt/conda/lib/python3.6/site-packages/unfazed/snv_phaser.py", line 262, in run_read_phasing whole_region=False, File "/opt/conda/lib/python3.6/site-packages/unfazed/informative_site_finder.py", line 240, in find dnms, pedigrees, vcf_name, search_dist, threads, build, whole_region File "/opt/conda/lib/python3.6/site-packages/unfazed/informative_site_finder.py", line 668, in find_many whole_region, TypeError: multithread_find_many() missing 1 required positional argument: 'build'
Also, just added --build 38
as well to see if that solved the build question but it gives the same error. To clarify, the error is the same as the previous one when adding the -t 1
.
Yes, there is a bug in the code at 668
. I have corrected it in the repo and will push a new release with the fix as soon as the tests finish. I will also work on the exception handling and improved tests.
I think the bug I fixed is probably not the same one you ran into before -t 1
, unfortunately. We'll see, but I expect it's going to be a different issue.
Okay, thanks
I have created a new release v1.0.1 which should fix the known bug
Okay, I tried installing with conda but it looks like the new version is not there yet?
Also, I took some time to look through the test files in the GitHub and decided to change my bed from saying SNV or INDEL to saying POINT. Then I ran it on 10 sites and it ran on those sites (with parent-of-origin) for four of them! I will do some more testing but it looks like POINT is the key and not to use SNV or INDEL for the bed. Also, this worked on the VCF did not contain AB if that is helpful.
Thanks,
Tychele
Ah, a great catch. I'll work on improving the generality of that term as well, and clarifying the necessary VCF fields (allele balance is calculated from AD in Unfazed). I really appreciate your patience and persistence in working through these new-software issues.
Hi,
The program is generally working well now (currently using v1.0.1). However, for a subset of the trios in my dataset (n = 48), I keep getting this error:
UNFAZED v1.0.1
Traceback (most recent call last):
File "/opt/conda/bin/unfazed", line 10, in <module>
sys.exit(main())
File "/opt/conda/lib/python3.7/site-packages/unfazed/__main__.py", line 237, in main
unfazed(args)
File "/opt/conda/lib/python3.7/site-packages/unfazed/unfazed.py", line 648, in unfazed
args.split_error_margin,
File "/opt/conda/lib/python3.7/site-packages/unfazed/snv_phaser.py", line 419, in phase_snvs
split_error_margin,
File "/opt/conda/lib/python3.7/site-packages/unfazed/snv_phaser.py", line 262, in run_read_phasing
whole_region=False,
File "/opt/conda/lib/python3.7/site-packages/unfazed/informative_site_finder.py", line 240, in find
dnms, pedigrees, vcf_name, search_dist, threads, build, whole_region
File "/opt/conda/lib/python3.7/site-packages/unfazed/informative_site_finder.py", line 650, in find_many
chrom_ranges[chrom],
KeyError: 'chrX'
When I remove the X chromosome for those trios it runs fine. The children are all males in these particular trios so am assuming the de novo would be primarily on the X chromosome from the mother (a possible exception being the pseudoautosomal regions). However, the whole program crashes because of the KeyError and I don't get the results for the rest of the genome without removing the X chromosome and rerunning. Also, I looked over the bed inputs and they look fine.
Tychele
For male samples Unfazed is intended to only phase the PAR for chrX and automatically assign the rest of the chrX variants to maternal origin. Clearly that's not working and I'll figure it out. Thanks for pointing this out!
Thank you! The tool is working quite well. I compared it with some code we have for phasing using the nearest variant (Whatshap phased vcf plus some custom scripts). Unfazed found more variants and when both approaches looked at the same site there was 100% concordance on the parent-of-origin (awesome!).
Also, wanted to mention one other error that I have seen so far and that was this one:
UNFAZED v1.0.1
Traceback (most recent call last):
File "/opt/conda/bin/unfazed", line 10, in <module>
sys.exit(main())
File "/opt/conda/lib/python3.7/site-packages/unfazed/__main__.py", line 237, in main
unfazed(args)
File "/opt/conda/lib/python3.7/site-packages/unfazed/unfazed.py", line 648, in unfazed
args.split_error_margin,
File "/opt/conda/lib/python3.7/site-packages/unfazed/snv_phaser.py", line 419, in phase_snvs
split_error_margin,
File "/opt/conda/lib/python3.7/site-packages/unfazed/snv_phaser.py", line 262, in run_read_phasing
whole_region=False,
File "/opt/conda/lib/python3.7/site-packages/unfazed/informative_site_finder.py", line 303, in find
dad_idx, ref_depths, alt_depths, genotypes, gt_quals
File "/opt/conda/lib/python3.7/site-packages/unfazed/informative_site_finder.py", line 102, in is_high_quality_site
if min_ab <= allele_bal <= max_ab:
UnboundLocalError: local variable 'min_ab' referenced before assignment
When I switched from the DeepVariant to GATK vcf for this trio everything went great so am not sure if it is a DeepVariant VCF issue (the VCF is valid) or something else.
Very helpful error messages. I'll push a new version with fixes asap
Hi,
I am new to programming and I am trying to run Unfazed to phase my DNMs, but could't make it
Here is my code:
unfazed -d My_DNM_File.vcf -s Main_VCF_File_With_All_Variants.vcf.gz -p pedigree.ped -b bamFiles/ -t 2 -o vcf --outfile test.vcf -r ref/hs37d5.fa
My_DNM_File.vcf
contains the vcf file header and only the DNM variant I am testing (4 44986884)Main_VCF_File_With_All_Variants.vcf.gz
is a multi-sample sorted/bgzipped/indexed vcf file with all variants of the family (i.e. Not only DNMs)pedigree.ped
is a tab delimited file that shows the relationship of the trio I am testing (Format: Family_ID Sample_ID Father Mother Sex Phenotype)bamFiles/
the directory that contains the three bam files of the trio (named as recommended by the tool: sample_ID.bam)Output: No usable informative sites for variant 4:44986883-44986884
However, I am using this DNM for the test because I am sure it can be phased as there is an informative site in the same read near to it: CHROM POS Child Father Mother 4 44986884 0/1 0/0 0/0 <-- My DNM site 4 44986893 0/1 0/0 1/1 <-- Informative site
Appreciate your help