Several possible improvements recommendation for pHASER.

Hi @secastel :+1: pHASER has been quite helpful for me. However, I have encountered several issues that would need attention to make this tool better.

1) The vcf file output by pHASER isn't quite compatible with the GATK tool. I am not able to figure out what actually is the problem but I suspect there is some problem with the structure of the output *.vcf when written by pHASER. I can give you more details if need be.

2) GW (genome wide) phasing is quite a useful thing, but I think if there was a choice to update the GT field rather with PG. I think in some instances we would want to get the haplotype states of the particular block. Note: I am actually working with F1 hybrid data and in that situation GW phase using pHASER led to switch errors (verified by comparison of the several bam and vcf). The solution to the haplotyping problem (especially in the F1 hybrids) is to get the haplotypes as much big as possible represented by unique PI. So, each PI has two haplotypes; now we can test statistically if Haplotype-A vs. Haplotype-B belong to Population_X vs. Population_Y using OddsRatio or Markov Model. I have actually just finished writing a python script to do this using Odds Ratio. Currently I am writing another part for the use of Markov Model, but being a biologist its taking me some time. In nutshell, my program uses the PB and PI generated by your program to stitch the haplotypes. So, if you could add an option to update the GT by PB values it would be helpful.

3) Another addon: GATK generates the haplotypes in the given format GT:AD:DP:GQ:PGT:PID:PL 0/1:80,25:105:99:0|1:5398_A_G:780,0,3463 GT:AD:DP:GQ:PGT:PID:PL 0/1:14,4:18:83:1|0:47883_G_A:83,0,472

I transfered the phase state from PGT to GT using awk and supplementd that vcf as phased input. The ouput file was able to extend the haplotypes of the block quite significantly for several block. I think this capability would he helpful.

4) Phasing of Indels: Also, the phasing of InDels can be improved by phasing the SNPs first, then transferring the phase state in PB to GT (which I did using awk again). The phased state of the InDels was quite improved. I think this can be incorporated. But, the issue that came up was the following: The FORMAT field would look like following: GT:AD:DP:PL:PG:PB:PI:PW:PC:PG:PB:PI:PW:PC So, when the phase output from pHASER is supplemented as Phased-Input for phasing of InDels the field PG:PB:PI:PW:PC get extended along with the SAMPLE field. This can be probably removed.

5) This is the issue I previously reported: Choice to output the phase of the homozygous alternate allele (1|1, or 2|2, 3|3) and haplotype if they are connected with the heterozygous allele. I actual tried to write some thing with in your pHASER but couldn't track the several def function()

Hope these comments don't bother you too much.

And, finally "Happy New Year 2017 ! " :) :+1:

Hey there. Thanks for the very thorough description above. I'd like to make a general note that phASER has been designed with the most common use case in mind, that is phasing diploid genomes of single individuals that have been previously population phased. Most of the time this ends up being human genomes. In general I am hesitant to add highly specialized options, since these will confuse the majority of users. For things that are specific to your analysis I would encourage you to build your own scripts that process the output of phASER.

That being said, I have some questions for you.

1) Could you let me know which GATK tool you are trying to use and provide a few more details so that I can try to reproduce the issue? I'm guessing this is related to the request in #12?

2) I understand that in some situations you may want to use local phasing within haplotype blocks as opposed to the genome wide phasing, which is why phASER includes the PG and PB fields in the output. Putting PG into the GT field would not conform to VCF standards, since the phasing across different blocks (PB) cannot be compared. I worry that this would result in misinterpretation of the phasing results. Is there a specific reason that you can't just use the PG field for your application? You could write a simple script to replace GT with PG if you really need to.

3) I'm a little confused by what you're doing here. You used which GATK tool to generate haplotypes? You then proceeded to move the the local haplotype phase generated by GATK (PGT) into the GT field and run phASER? While you may have a few larger blocks doing this I would not trust them. Similar to my comment in (2) you cannot simply transfer local phase states to genome wide phase state (which is what the GT field is assumed to be) because local phasing across haplotype blocks is not consistent. I'm not sure if you are using the GATK read backed phasing tool, but if you are I suggest you refer to figure S3 in our manuscript, where we show that the performance of the GATK tool is extremely poor. I would not trust the output from it.

4) I think this suffers from the same issue as (2) and (3). You absolutely cannot transfer phase state information from PB to GT and re-run the tool. This violates assumptions that phASER makes about the GT field.

5) I'm sorry I didn't get around to answering this. Homozygous alleles are not informative for phasing adjacent haplotype blocks since there is no way to distinguish one haplotype from another in the reads. Thus these sites cannot be used during phasing. I'm not sure why you would want to include their phase in the local phase blocks. Additionally, I'm unsure what you mean by connected to a heterozygous allele? Reads that overlap a heterozygous allele and a homozygous allele do not report any informative information. Since the allele is homozygous one already knows that each of the haplotypes connected to the heterozygous allele have the same base at the homozygous allele's position...

Hi @secastel :+1: I totally understand that pHASER is designed for single diploid genomes. I had looked into and used so many phasing tool but pHASER is the the one that worked for my purpose. I am trying to phase several hybrid individuals one by one, so pHASER is the best choice for me for now. In the above comments I was making suggestions that could improve the purpose of the pHASER (based on what I read on your paper).

I look into a problem from biological and evolutionary standpoint and its common that our approach to problem solving are not aligning. Hope I am not offending your. Please see below for more details.

Could you let me know which GATK tool you are trying to use and provide a few more details so that I can try to reproduce the issue? I'm guessing this is related to the request in #12?

I am working on other problem now and I have to dig around to really find what the specific problem was then (when I reported). Reading you python script I could tell that you didn't use pyVCF module to read and write the files, rather read the vcf file line by line as a text. I am not sure but was thinking if that was the problem. After running the pHASER, I used GATK SelectVariants with -sn option to select the sample that was phased https://software.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php where the problem started. I can give your more details later. Also, the problem could be specific to my case but please check.

I understand that in some situations you may want to use local phasing within haplotype blocks as opposed to the genome wide phasing, which is why phASER includes the PG and PB fields in the output. Putting PG into the GT field would not conform to VCF standards, since the phasing across different blocks (PB) cannot be compared. I worry that this would result in misinterpretation of the phasing results. Is there a specific reason that you can't just use the PG field for your application? You could write a simple script to replace GT with PG if you really need to.

I am not suggesting to replace GT field with PG, but to update the values in GT with values from PG; something like --gw_phase_vcf (0) - Replace GT field of output VCF using phASER genome wide phase but regardless of the Phase Confidence. Reason, being that I can write a python script but other biologists won't have this skill. Also, other VCF manipulating are not written to work with this PG field but only with GT field. So, if someone wants to mine the haplotypes from certain intervals (using bed file) will easily able to do it.
Yes, I understand that this would not generate a good genome wide phase but I checked and verified that most of the site really do have switch errors. I am thinking of taking the GT or PG values along with PI index to further extend the phasing results using a new method I am trying to write.

I'm a little confused by what you're doing here. You used which GATK tool to generate haplotypes? You then proceeded to move the the local haplotype phase generated by GATK (PGT) into the GT field and run phASER? While you may have a few larger blocks doing this I would not trust them. Similar to my comment in (2) you cannot simply transfer local phase states to genome wide phase state (which is what the GT field is assumed to be) because local phasing across haplotype blocks is not consistent. I'm not sure if you are using the GATK read backed phasing tool, but if you are I suggest you refer to figure S3 in our manuscript, where we show that the performance of the GATK tool is extremely poor. I would not trust the output from it.

If you look GT:AD:DP:GQ:PGT:PID:PL 0/1:0,15:15:48:1|0:12438_T_C:700,48,0 You can see that PGT is 1|0 which was generate by GATK while runnning HaplotypeCaller -gvcf first and then 'joint genotyping'. http://gatkforums.broadinstitute.org/gatk/discussion/3893/calling-variants-on-cohorts-of-samples-using-the-haplotypecaller-in-gvcf-mode I think GATK people are using RBphasing here but the ouput format while running RBphasing exclusively is different. I will have to email them. I am trying to talk about this option: --gw_phase_method (0) - Method to use for determing genome wide phasing. NOTE requires input VCF to be phased Since, it required phased input, I was suggesting if there was way to indicate which FIELD (tag) to use as phased. The default can be set to GT but if we have vcf where the phase is indicated by PS, or PGT, of HP we would want to use this. Transferring the phase state to GT take another time.

(2) you cannot simply transfer local phase states to genome wide phase state (which is what the GT field is assumed to be) because local phasing across haplotype blocks is not consistent.

I understand this issue, but using the phase state of the input file (GT, PGT, or HP) gets the phase state right for PG in the output (checked visually by comparing bam, vcf on IGV with haplotypes_output.txt). The GT is assumed to be genome wide phase state but it really does have switch errors at lots of places (not matter how much improvements in algorithms are made, because genome is a very big string) so rather improving PG is a good thing to do. Taking phase state of PG (for different PI blocks) along with the frequency of the allele in the population/s can help to create better genome wide phase in F1 hybrids, population samples from sympatric regions, human individuals who have mixed ethnicities, etc.

I think this suffers from the same issue as (2) and (3). You absolutely cannot transfer phase state information from PB to GT and re-run the tool. This violates assumptions that phASER makes about the GT field.

I understand. All, I am trying to do is improve (or get the right) phase state of indels with in PG not within GT.

Homozygous alleles are not informative for phasing adjacent haplotype blocks since there is no way to distinguish one haplotype from another in the reads.

I am more focusing not on the phasing but on the output file (i.e the haplotypes). In the situation when I want to switch the bases in the reference genome with the bases from phased haplotypes (or vcf) for doing further population genetics analyses. I would not just want to replace the site that are heterozygous but all the site that have changed compared to reference genome (which would mean 1/1, 2/2 but not 0/0).

Sorry, all this is just too complicated and I just wanted to bring up some thoughts. I totally agree with the things that need to be done with the purpose of the program.

Hi, just wanted to say that I haven't completely forgotten about this. I'm swamped with grant writing at the moment, but my plan is to reply to your message as soon as I have time. Sorry for the delay.

Alright, back to this, sorry for the delay.

1) I fixed the problem that was preventing you from using the GATK tool with phASER outputted VCFs. The issue was that I wasn't adding the "PS" tag to the header. You can find it in the latest version (0.9.9.2).

2) If you want to output a file where the PG is outputted in the GT field regardless of the phase confidence you can use the --gw_phase_vcf 2 and this will output using the PS VCF standard. In this case when the phase confidence is less than --gw_phase_vcf_min_confidence it still outputs the variant's PG phase, but with the addition of PS which defines the block index. This should serve for your purposes.

3) I don't think I'm really comfortable changing the code to take phasing information from anything other than the GT field. I'm worried this will just overly complicate things. Sorry if this complicates your pipeline, but as I said, it needs to be designed for the most common use case, which is going to be a population phased VCF as input with the GT field for all variants phased.

secastel / phaser

Several possible improvements recommendation for pHASER. #22