secastel / phaser

phasing and Allele Specific Expression from RNA-seq
GNU General Public License v3.0
111 stars 36 forks source link

Issues with VCF merge after running phaser #41

Open everestial opened 6 years ago

everestial commented 6 years ago

Hi @secastel

I may have raised this issue before, but the problem with VCF merge still exists. I am not sure if VCF from other callers are running into this issue, but with GATK (HaplotypeCaller) generated VCF's, I can run run phaser. But, the single sample VCF's that are not generated cannot be merged back.

This is the error message if it helps:

$ java -jar -Xmx16g /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T CombineVariants -R lyrata_genome.fa -V ms01e_phased.vcf -V ms02g_phased.vcf -V ms03g_phased.vcf -V ms04h_phased.vcf -o F1.phased_variants.Final.vcf

INFO  19:25:42,458 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  19:25:42,460 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7-0-gcfedb67, Compiled 2016/12/12 11:21:18 
INFO  19:25:42,461 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute 
INFO  19:25:42,461 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk 
INFO  19:25:42,461 HelpFormatter - [Thu Jan 04 19:25:42 EST 2018] Executing on Linux 4.10.0-42-generic amd64 
INFO  19:25:42,461 HelpFormatter - OpenJDK 64-Bit Server VM 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12 
INFO  19:25:42,465 HelpFormatter - Program Args: -T CombineVariants -R lyrata_genome.fa -V ms01e_phased.vcf -V ms02g_phased.vcf -V ms03g_phased.vcf -V ms04h_phased.vcf -o F1.phased_variants.Final.vcf 
INFO  19:25:42,469 HelpFormatter - Executing as everestial007@everestial007-Inspiron-3647 on Linux 4.10.0-42-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12. 
INFO  19:25:42,469 HelpFormatter - Date/Time: 2018/01/04 19:25:42 
INFO  19:25:42,469 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  19:25:42,469 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  19:25:42,492 GenomeAnalysisEngine - Strictness is SILENT 
INFO  19:25:42,741 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 
INFO  19:25:43,091 GenomeAnalysisEngine - Preparing for traversal 
INFO  19:25:43,094 GenomeAnalysisEngine - Done preparing for traversal 
INFO  19:25:43,094 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO  19:25:43,095 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining 
INFO  19:25:43,095 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime 
##### ERROR --
##### ERROR stack trace 
java.lang.NumberFormatException: For input string: ""
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
    at java.lang.Integer.parseInt(Integer.java:592)
    at java.lang.Integer.valueOf(Integer.java:766)
    at htsjdk.variant.vcf.AbstractVCFCodec.createGenotypeMap(AbstractVCFCodec.java:717)
    at htsjdk.variant.vcf.AbstractVCFCodec$LazyVCFGenotypesParser.parse(AbstractVCFCodec.java:129)
    at htsjdk.variant.variantcontext.LazyGenotypesContext.decode(LazyGenotypesContext.java:158)
    at htsjdk.variant.variantcontext.LazyGenotypesContext.getGenotypes(LazyGenotypesContext.java:148)
    at htsjdk.variant.variantcontext.GenotypesContext.iterator(GenotypesContext.java:465)
    at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.mergeGenotypes(GATKVariantContextUtils.java:1556)
    at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.simpleMerge(GATKVariantContextUtils.java:1224)
    at org.broadinstitute.gatk.tools.walkers.variantutils.CombineVariants.map(CombineVariants.java:361)
    at org.broadinstitute.gatk.tools.walkers.variantutils.CombineVariants.map(CombineVariants.java:143)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
    at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
    at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
    at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:98)
    at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:316)
    at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
    at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.7-0-gcfedb67):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions https://software.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: For input string: ""
##### ERROR ------------------------------------------------------------------------------------------

The issue at hand is that even if I can merge the data using other tools, I will still have problems because I further have to use GATK to do my downstream analyses. Can you please check the issue if time permits?

Here are my files.

ms01e_phased.vcf.gz ms02g_phased.vcf.gz ms03g_phased.vcf.gz ms04h_phased.vcf.gz

secastel commented 6 years ago

I'll look into this. I know that bcftools merge works should work fine with phASER outputted VCFs, so until I can identify the problem with the GATK you could try that.

secastel commented 6 years ago

I've had the time to do some testing. I've found a few things.

1) GATK has no problem merging VCFs outputted by phASER. I confirmed this using your files, and my own files.

2) In your case, the GATK is giving you an error because there is a problem with ms04h_phased.vcf. If you trying running the merge without that one VCF you will see it works fine. Digging deeper, I found I was still unable to merge ms04h_phased.vcf even after I stripped all of the phASER associated tags from it. Thus, I suspect that there is some problem with this VCF that is totally independent of phASER. Given that the GATK outputs a totally uninformative error message, I'm not sure what that problem is. You will have to do some more digging, but as I said, this has nothing to do with phASER.

I confirmed that all of your VCFs (including ms04h_phased.vcf) can be merge without issue using bcftools merge, so you may just want to do this.

everestial commented 6 years ago

@secastel Thank for the update. I will try to see what went wrong in that sample.

everestial commented 6 years ago

@secastel : I again had this issue. I think I may have found the reason why vcf's arenot merging using GATK CombineVariants.

Here is what I did for troubleshooting:

Unable to merge I ran the GATK validate variants, which gives me the error message:

everestial007@.../outputs_RBphased_VCF$ java -jar -Xmx6g /home/everestial007/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar -T ValidateVariants -R ${ref_genome} -V ms01e_phased.vcf
INFO  09:27:37,686 HelpFormatter - ---------------------------------------------------------------------------------- 
INFO  09:27:37,690 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.8-0-ge9d806836, Compiled 2017/07/28 21:26:50 
INFO  09:27:37,690 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute 
INFO  09:27:37,691 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk 
INFO  09:27:37,691 HelpFormatter - [Fri Mar 02 09:27:37 EST 2018] Executing on Linux 4.13.0-32-generic amd64 
INFO  09:27:37,692 HelpFormatter - OpenJDK 64-Bit Server VM 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12 
INFO  09:27:37,698 HelpFormatter - Program Args: -T ValidateVariants -R /media/everestial007/SeagateBackup4.0TB2/New_Alignment_Set/RefNindex_lyrata/lyrata_genome.fa -V ms01e_phased.vcf 
INFO  09:27:37,701 HelpFormatter - Executing as everestial007@everestial007-Inspiron-3647 on Linux 4.13.0-32-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12. 
INFO  09:27:37,702 HelpFormatter - Date/Time: 2018/03/02 09:27:37 
INFO  09:27:37,702 HelpFormatter - ---------------------------------------------------------------------------------- 
INFO  09:27:37,702 HelpFormatter - ---------------------------------------------------------------------------------- 
ERROR StatusLogger Unable to create class org.apache.logging.log4j.core.impl.Log4jContextFactory specified in jar:file:/home/everestial007/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar!/META-INF/log4j-provider.properties
ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console...
INFO  09:27:37,986 GenomeAnalysisEngine - Deflater: IntelDeflater 
INFO  09:27:37,987 GenomeAnalysisEngine - Inflater: IntelInflater 
INFO  09:27:37,988 GenomeAnalysisEngine - Strictness is SILENT 
INFO  09:27:38,603 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 
INFO  09:27:39,014 GenomeAnalysisEngine - Preparing for traversal 
INFO  09:27:39,017 GenomeAnalysisEngine - Done preparing for traversal 
INFO  09:27:39,018 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO  09:27:39,018 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining 
INFO  09:27:39,018 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime 
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 3.8-0-ge9d806836): 
##### ERROR
##### ERROR This means that one or more arguments or inputs in your command are incorrect.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR If the problem is an invalid argument, please check the online documentation guide
##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
##### ERROR
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions https://software.broadinstitute.org/gatk
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
##### ERROR
##### ERROR MESSAGE: File /media/everestial007/SeagateBackup4.0TB2/RNAseq_Data_Analyses/phaser_to_ASE_on_DiploidGenome/outputs_RBphased_VCF/ms01e_phased.vcf fails strict validation: the Allele Number (AN) tag is incorrect for the record at position 2:181028, 6 vs. 2
##### ERROR ------------------------------------------------------------------------------------------

You can see that there is problem with AN tag and the reason I think this is happening is because phaser reads in multi sample VCF but outputs only one sample VCF, where the "AN" values are not recalculated. So, GATK is throwing some error because VCF is having validation problems. Either there needs to be a method to ignore this flag in GATK or there needs to be method to update AN if phaser is outputting single sample VCF's.

For now, I am able to merge the VCFs using "bcftools" and GATK doesnot throw validation error on this merge VCF because the number of samples before running "phaser" and after running "bcftools merge" has matched "AN" values.

My issues is solved for now, but I am leaving this message so potential data structure problem can be sorted out either with GATK or phaser in the future.

secastel commented 6 years ago

Thanks for figuring this out! I was stumped, due to the GATK's uninformative error messages. I don't think I want to go down the road of re-calculating tags that are not generated by phASER, so I will likely leave the code as is. However, it is very useful to know that the AN tag potentially causes problems when using GATK for downstream processing of samples.

I'm a bit confused as to why I was seeing the error when merging only one of the VCFs (ms04h_phased.vcf) but not the others, since all of them have the same AN tag issue. Do you have any insight as to why this is the case?

everestial commented 6 years ago

A line from the error message: fails strict validation: the Allele Number (AN) tag is incorrect for the record at position 2:15881224, 6 vs. 2

and AN is an INFO level tag not FORMAT level. ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">

Here is what I think the reason is:

But, with some further debugging I found another problem - that GATK sometimes seems to raise error for the same problem in some files but not in other files. I am not sure what is the case. But, see these error messages below:

With sample ms04h.

$java -jar -Xmx6g /home/everestial007/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar -T ValidateVariants -R ${ref_genome} -V ms04h_phased.vcf
##### ERROR MESSAGE: File /media/everestial007/SeagateBackup4.0TB2/RNAseq_Data_Analyses/phaser_to_ASE_on_DiploidGenome/testing VCF merge/ms04h_phased.vcf fails strict validation: the Allele Number (AN) tag is incorrect for the record at position 2:15881224, 6 vs. 2

The vcf line at this position is: 2 15881224 . T . 143.24 PASS AN=6;BaseQRankSum=1.75;ClippingRankSum=0.00;DP=9;ExcessHet=3.0103;FS=3.522;InbreedingCoeff=-0.1072;MQ=41.48;MQRankSum=0.366;QD=15.92;ReadPosRankSum=0.345;SOR=2.712 GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC:PM 0/0:4:4:3:0:0/0:.:.:0/0:.:.

But, with other samples (ms02g in this case).

$java -jar -Xmx6g /home/everestial007/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar -T ValidateVariants -R ${ref_genome} -V ms02g_phased.vcf
##### ERROR MESSAGE: File /media/everestial007/SeagateBackup4.0TB2/RNAseq_Data_Analyses/phaser_to_ASE_on_DiploidGenome/testing VCF merge/ms03g_phased.vcf fails strict validation: one or more of the ALT allele(s) for the record at position **2:15881018** are not observed at all in the sample genotypes

The vcf line at position also has AN=6 but GATK doesn't raise validation error at this position, but at 2 15881224 . T . 143.24 PASS AN=6;BaseQRankSum=1.75;ClippingRankSum=0.00;DP=9;ExcessHet=3.0103;FS=3.522;InbreedingCoeff=-0.1072;MQ=41.48;MQRankSum=0.366;QD=15.92;ReadPosRankSum=0.345;SOR=2.712 GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC:PM 0/0:1:1:3:0:0/0:.:.:0/0:.:.

So, the confusing part is, Why is GATK raising Validation error differently for same problem in different files/samples?

Conclusion: On the problem regarding GATK merge feature, my thought is that there are multiple validation failures when the INFO tags aren't updated in single sample vcf, so GATK merge cannot proceed. But, this seems to go away as soon as you merge the single sample VCFs back to same number of multisample VCF file as original input to the phaser. See, below:

When merging only 3 samples (ms01e, ms02g, ms03g):

$ java -jar -Xmx6g /home/everestial007/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar -T CombineVariants -R ${ref_genome} -V ms01e_phased.vcf -V ms02g_phased.vcf -V ms03g_phased.vcf -o F1_RBphased_variants.merged.vcf -genotypeMergeOptions UNSORTED
----------

Done. There were no warn message

But, now there is problem with validation:

$ java -jar -Xmx6g /home/everestial007/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar -T ValidateVariants -R ${ref_genome} -V F1_RBphased_variants.merged.vcf
##### ERROR
##### ERROR MESSAGE: File /.../F1_RBphased_variants.merged.vcf fails strict validation: one or more of the ALT allele(s) for the record at position 2:15881018 are not observed at all in the sample genotypes
##### ERROR ------------------------------------------------------------------------------------------

When merging all 4 samples:

$ java -jar -Xmx6g /home/everestial007/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar -T CombineVariants -R ${ref_genome} -V ms01e_phased.vcf -V ms02g_phased.vcf -V ms03g_phased.vcf -V ms04h_phased.vcf -o F1_RBphased_variants.merged.vcf -genotypeMergeOptions UNSORTED .....

ERROR
##### ERROR MESSAGE: For input string: ""
##### ERROR ------------------------------------------------------------------------------------------

The above command does produce a partial VCF, but no validation error is raised.

 $ java -jar -Xmx6g /home/everestial007/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar -T CombineVariants -R ${ref_genome} -V ms01e_phased.vcf -V ms02g_phased.vcf -V ms03g_phased.vcf -V ms04h_phased.vcf -o F1_RBphased_variants.merged.vcf -genotypeMergeOptions UNSORTED
.....
----------
Done. There were no warn messages.
-----------

This is lots of information, hope I didn't overwhelm you. I am going to raise an issue on GATK forum on this case.

One last question, is "phaser" someway associated with "GATK" project. I have a tool coming in about a week that totally complements with your "phaser" and I want to see if it can be incorporated in "GATK" pipeline. I will provide the details in another issue.

Thanks,

everestial commented 6 years ago

@secastel

I have to add that merging VCFs has been a big problem with both bcftools and GATK CombineVariants.

With bcftools merge the files are merging, but I am getting duplicate records at lots of places; see the issue - https://github.com/samtools/bcftools/issues/754

With GATK CombineVariants several samples fail to merge despite having come from "phaser" with same standards. I have raised an issue here: https://gatkforums.broadinstitute.org/wdl/discussion/11571/combine-variants-throwing-a-bug-error