jbelyeu / unfazed

Unfazed by genomic variant phasing
MIT License
26 stars 0 forks source link

Trying to run Unfazed #1

Open mkohailan opened 4 years ago

mkohailan commented 4 years ago

Hi,

I am new to programming and I am trying to run Unfazed to phase my DNMs, but could't make it

Here is my code: unfazed -d My_DNM_File.vcf -s Main_VCF_File_With_All_Variants.vcf.gz -p pedigree.ped -b bamFiles/ -t 2 -o vcf --outfile test.vcf -r ref/hs37d5.fa

My_DNM_File.vcf contains the vcf file header and only the DNM variant I am testing (4 44986884) Main_VCF_File_With_All_Variants.vcf.gz is a multi-sample sorted/bgzipped/indexed vcf file with all variants of the family (i.e. Not only DNMs) pedigree.ped is a tab delimited file that shows the relationship of the trio I am testing (Format: Family_ID Sample_ID Father Mother Sex Phenotype) bamFiles/ the directory that contains the three bam files of the trio (named as recommended by the tool: sample_ID.bam)

Output: No usable informative sites for variant 4:44986883-44986884

However, I am using this DNM for the test because I am sure it can be phased as there is an informative site in the same read near to it: CHROM POS Child Father Mother 4 44986884 0/1 0/0 0/0 <-- My DNM site 4 44986893 0/1 0/0 1/1 <-- Informative site

Appreciate your help

jbelyeu commented 4 years ago

That is confusing. I don't see any mistakes in your command and I don't think you'd see that particular message if bad inputs had caused an issue (although I could be wrong and will keep thinking about it). The most likely explanation is that the particular informative site you mentioned didn't pass quality control filters built into unfazed. The main QC requirements (for a heterozygous variant) are:

For homozygous calls the same filters apply except the read allele balance must be between 0.0-0.2 or 0.8-1.0 (depending on whether it's a hom-alt or a hom-ref.

Please let me know if this is helpful. I'd be happy to dig more into it with you, especially if the issue is a bug in my code I can fix.

arq5x commented 4 years ago

What variant caller was used to create the VCF? Does it have AB, DP and GQ attributes in the genotype columns?

mkohailan commented 4 years ago

Thanks for your reply

I am using FreeBayes (default run) to call the variants. After your comments, I ran FreeBayes with --genotype-qualities option to write the GQ in the genotype field of each sample. So, now I have DP and GQ written in the genotype column for each sample, and the AB in the INFO column.

I ran a test and it worked for some of my variants (including the one in the above example). So thanks for your help

For one of the variants (below), I can see that both the DNM site and the informative site can pass these filters, but still not phased. Are there any other criteria used?

CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO | FORMAT | MOTHER | FATHER | CHILD

-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- 9 | 88075348 | . | A | G | 174.68 | . | AB=0.40625;………. | GT:AD:AO:DP:GQ:PL:QA:QR:RO | 0/0:37,0:0:38:99:0,111,1311:0:1465:37 | 0/0:41,0:0:41:99:0,123,1303:0:1486:41 | 0/1:19,13:13:32:99:248,0,554:448:724:19 9 | 88075353 | . | A | G | 1525 | . | AB=0.462069;………. | GT:AD:AO:DP:GQ:PL:QA:QR:RO | 0/0:37,0:0:37:99:0,111,1252:0:1393:37 | 0/1:23,18:18:41:99:455,0,648:691:857:23 | 0/1:19,14:14:33:99:259,0,565:496:739:19

By the way, is there an option to play with the quality control filters set by the tool?

jbelyeu commented 4 years ago

A possible reason a variant like this one might not be phased could be if there are multiple informative sites in the region that give contradictory results, possibly because a recombination event or sequencing error occurred. If you use the --include-ambiguous option on the command line it will report these variants as well, so that might help.

Quality control filters are not currently exposed to the user, although this may change shortly. I'm also considering ways to improve the information given to the user about why certain variants aren't phased, although I want to avoid overwhelmingly large outputs as well.

mkohailan commented 4 years ago

Great! looking for the new versions then

Thank you

tycheleturner commented 3 years ago

Hi,

Thanks for making this tool. I tried it out yesterday on some data and the output for all sites is:

No usable informative sites for variant “variant name.”

We know the parent-of-origin for many of the variants based on assessing the read data, so would expect this program to find at least some of them. Based on your preprint, it looks like the number of variants where parent-of-origin can be determined is higher than in previous papers/methods so that is why we hope to use it.

The command is as follows: /path/to/unfazed -d childname_denovo.bed -s childname.trio.vcf.gz -p childname.ped --bam-pairs childname:/path/to/childname.cram dadname:/path/to/dadname.cram momname:/path/to/momname.cram -r /path/to/GRCh38_full_analysis_set_plus_decoy_hla.fa

The input of the bed file is as follows (only the top line shown for space reasons): chr1 7326507 7326507 childname SNV

The input of the childname.trio.vcf is a full genome-wide vcf from DeepVariant (the FORMAT field is as follows: GT:DP:AD:GQ:PL:RNC). I have also tried a VCF from GATK and also tried adding the AB in to one of these VCFs. None of these changes worked. I have also tried only giving it the child cram and not the parents … still the same result.

The ped format is standard plink format

Would like to run this tool and any insight is appreciated.

Thanks,

Tychele

tycheleturner commented 3 years ago

I should note that I have also tried the bed file as follows:

chr1 7326506 7326507 childname SNV

jbelyeu commented 3 years ago

Hmm, that's disappointing. It's possible there's an instability I'm not aware of. Could you try running with -t 1? This option requires it to run with just a single thread but allows improved exception handling.

tycheleturner commented 3 years ago

Okay, added -t 1 to end and now it says

UNFAZED v1.0.0 Traceback (most recent call last): File "/opt/conda/bin/unfazed", line 10, in <module> sys.exit(main()) File "/opt/conda/lib/python3.6/site-packages/unfazed/__main__.py", line 237, in main unfazed(args) File "/opt/conda/lib/python3.6/site-packages/unfazed/unfazed.py", line 648, in unfazed args.split_error_margin, File "/opt/conda/lib/python3.6/site-packages/unfazed/snv_phaser.py", line 419, in phase_snvs split_error_margin, File "/opt/conda/lib/python3.6/site-packages/unfazed/snv_phaser.py", line 262, in run_read_phasing whole_region=False, File "/opt/conda/lib/python3.6/site-packages/unfazed/informative_site_finder.py", line 240, in find dnms, pedigrees, vcf_name, search_dist, threads, build, whole_region File "/opt/conda/lib/python3.6/site-packages/unfazed/informative_site_finder.py", line 668, in find_many whole_region, TypeError: multithread_find_many() missing 1 required positional argument: 'build'

tycheleturner commented 3 years ago

Also, just added --build 38 as well to see if that solved the build question but it gives the same error. To clarify, the error is the same as the previous one when adding the -t 1.

jbelyeu commented 3 years ago

Yes, there is a bug in the code at 668. I have corrected it in the repo and will push a new release with the fix as soon as the tests finish. I will also work on the exception handling and improved tests.

I think the bug I fixed is probably not the same one you ran into before -t 1, unfortunately. We'll see, but I expect it's going to be a different issue.

tycheleturner commented 3 years ago

Okay, thanks

jbelyeu commented 3 years ago

I have created a new release v1.0.1 which should fix the known bug

tycheleturner commented 3 years ago

Okay, I tried installing with conda but it looks like the new version is not there yet?

Also, I took some time to look through the test files in the GitHub and decided to change my bed from saying SNV or INDEL to saying POINT. Then I ran it on 10 sites and it ran on those sites (with parent-of-origin) for four of them! I will do some more testing but it looks like POINT is the key and not to use SNV or INDEL for the bed. Also, this worked on the VCF did not contain AB if that is helpful.

Thanks,

Tychele

jbelyeu commented 3 years ago

Ah, a great catch. I'll work on improving the generality of that term as well, and clarifying the necessary VCF fields (allele balance is calculated from AD in Unfazed). I really appreciate your patience and persistence in working through these new-software issues.

tycheleturner commented 3 years ago

Hi,

The program is generally working well now (currently using v1.0.1). However, for a subset of the trios in my dataset (n = 48), I keep getting this error:

UNFAZED v1.0.1
Traceback (most recent call last):
  File "/opt/conda/bin/unfazed", line 10, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.7/site-packages/unfazed/__main__.py", line 237, in main
    unfazed(args)
  File "/opt/conda/lib/python3.7/site-packages/unfazed/unfazed.py", line 648, in unfazed
    args.split_error_margin,
  File "/opt/conda/lib/python3.7/site-packages/unfazed/snv_phaser.py", line 419, in phase_snvs
    split_error_margin,
  File "/opt/conda/lib/python3.7/site-packages/unfazed/snv_phaser.py", line 262, in run_read_phasing
    whole_region=False,
  File "/opt/conda/lib/python3.7/site-packages/unfazed/informative_site_finder.py", line 240, in find
    dnms, pedigrees, vcf_name, search_dist, threads, build, whole_region
  File "/opt/conda/lib/python3.7/site-packages/unfazed/informative_site_finder.py", line 650, in find_many
    chrom_ranges[chrom],
KeyError: 'chrX'

When I remove the X chromosome for those trios it runs fine. The children are all males in these particular trios so am assuming the de novo would be primarily on the X chromosome from the mother (a possible exception being the pseudoautosomal regions). However, the whole program crashes because of the KeyError and I don't get the results for the rest of the genome without removing the X chromosome and rerunning. Also, I looked over the bed inputs and they look fine.

Tychele

jbelyeu commented 3 years ago

For male samples Unfazed is intended to only phase the PAR for chrX and automatically assign the rest of the chrX variants to maternal origin. Clearly that's not working and I'll figure it out. Thanks for pointing this out!

tycheleturner commented 3 years ago

Thank you! The tool is working quite well. I compared it with some code we have for phasing using the nearest variant (Whatshap phased vcf plus some custom scripts). Unfazed found more variants and when both approaches looked at the same site there was 100% concordance on the parent-of-origin (awesome!).

Also, wanted to mention one other error that I have seen so far and that was this one:

UNFAZED v1.0.1
Traceback (most recent call last):
  File "/opt/conda/bin/unfazed", line 10, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.7/site-packages/unfazed/__main__.py", line 237, in main
    unfazed(args)
  File "/opt/conda/lib/python3.7/site-packages/unfazed/unfazed.py", line 648, in unfazed
    args.split_error_margin,
  File "/opt/conda/lib/python3.7/site-packages/unfazed/snv_phaser.py", line 419, in phase_snvs
    split_error_margin,
  File "/opt/conda/lib/python3.7/site-packages/unfazed/snv_phaser.py", line 262, in run_read_phasing
    whole_region=False,
  File "/opt/conda/lib/python3.7/site-packages/unfazed/informative_site_finder.py", line 303, in find
    dad_idx, ref_depths, alt_depths, genotypes, gt_quals
  File "/opt/conda/lib/python3.7/site-packages/unfazed/informative_site_finder.py", line 102, in is_high_quality_site
    if min_ab <= allele_bal <= max_ab:
UnboundLocalError: local variable 'min_ab' referenced before assignment

When I switched from the DeepVariant to GATK vcf for this trio everything went great so am not sure if it is a DeepVariant VCF issue (the VCF is valid) or something else.

jbelyeu commented 3 years ago

Very helpful error messages. I'll push a new version with fixes asap