paulhager / smart-phase

A comprehensive and intelligent clinical phasing tool
GNU General Public License v3.0
13 stars 2 forks source link

suspicious haplotypes derived from smart-phase #6

Closed jielab closed 3 years ago

jielab commented 3 years ago

Dear Tim:

I now put a very small test data at https://github.com/jielab/ukb/tree/master/data: 1687346.vcf.gz and 1687346.bam.

And I run the following smart-phase command to phase the two APOE SNPs: java -jar /mnt/d/software_lin/smartPhase.jar -a 1687346.vcf.gz -p 1687346 -g apoe.b38.bed -r 1687346.bam -m 60 -x -vcf -c 0.1 -o 1687346.tsv

The phased haplotype in the output file 1687346_sp.vcf.gz (tag "SPGT") is the same as the original haplotype (tag "GT").

However, I actually found that the original GT is wrong. When I load the BAM file into IGV, as you see below, the first SNP is C/T, while the second SNP is C/C (i.e., non-polymorphic). Therefore, the GT for the second SNP rs7412 should be "1|1" instead of "1|0".

123

After I changed the GT for the second SNP rs7412 to "1|1" for the 1687346.vcf.gz file, and re-run the above command. Somehow, this time there is no genotype data in the output file 1687346_sp.vcf.gz. So, it seems that smart-phase consider the fixed VCF file is impossible.

Can you please kindly run this testing dataset on your side and let me know if I did something wrong here?

Thank you & best regards, Jie

timjeske commented 3 years ago

Dear Jie,

at position 19:44,908,822 the reference allele is a C, so there is no variant, the genotype is 0/0. Even if C would be the variant allele, a homozygous genotype cannot be phased, because you cannot differentiate from which parent the allele came, as it is identical. Only heterozygous variants are used to create haplotypes and to phase these variants as discussed previously.

So, in your exemplary VCF both the reference allele and the genotype of the second variant is incorrect. Nevertheless, you're right SmartPhase should report all variants in the input VCF in the result VCF (which would be identical to the input VCF in your case). @paulhager Could you please check, why the VCF is left empty in this case?

Best Tim

jielab commented 3 years ago

Dear Tim:

Thank you very much!

As shown in the IGV, this sample apparently has a C/T genotype for rs429358 and a C/C genotype for rs7412. Therefore, the two haplotypes for this sample must be C-C and T-C (i.e., APOE e3e4).

No matter what technical reason it might be, this sample could not have an APOE e1e3 haplotype, as reported by SmartPhase. So, this is my question.

I would deeply appreciate if you could take a look inside this and see if SmartPhase can phase this sample correctly.

Best regards, Jie

timjeske commented 3 years ago

Dear Jie,

please let me clarify some points:

  1. If you have a genotype C/T and C/C you don't need to do any phasing. So, I'm not really sure why you want to use SmartPhase for this combination. It's obvious that the haplotypes are C-C and C-T.

  2. We have already discussed in detail that SmartPhase is not primarily suitable for creating haplotypes as you want to do it, but rather phases combinations of heterozygous variants. Nevertheless, I have described how SmartPhase can be used for your special case here.

  3. You're files need to be consistent to get any useful results. So, first, make sure the reference alleles in your variant file are identical to the sequence in the reference genome (for position 44,908,822, the reference is a C and not a T as in your example file). Second, make sure that the genotypes in the VCF file fit to the BAM file (usually you would use the BAM file to do variant calling and then you're fine anyway). In your case, the BAM files contains only reads with a C at position 44,908,822, so there is actually no variant at this position, or the genotype would be 0/0 (if any other individual would have a variant at this position). If you use corrupted input files, you cannot really expect that a tool fixes everything for you.

We're happy to help as soon as you have implemented your analysis as proposed. Otherwise, it makes not really sense to discuss how SmartPhase should behave when it has to handle inconsistent input files.

Best regards Tim

jielab commented 3 years ago

Dear Tim:

I understand that Smart-Phase was not designed to call haplotypes from BAM files alone. Instead, it needs a VCF file and the VCF file must be in "correct" format, i.e., the REF allele must be the same as that in the Human reference genome. On the other hand, we should realize that the so-called REF allele is relative. When more and more people are deeply sequenced these days, REF could become ALT, and vice versa. Some other format, such as PLINK, simply use A1 and A2, instead of REF and ALT. So, as long as the genotype coding is correct, no matter which allele is REF or ALT, I feel that it should be fine.

For the VCF file that I provided, it is indeed inconsistent that the second SNP rs7412 has a 0|1 genotype while the BAM file shows non-polymorphic. However, the VCF file was indeed from the UKB genotyped data, not something that I made up. I guess that this is not so surprising, since one is from genotyping (plus imputation) and the other is from sequencing.

The reason why I am asking all these questions is not simply due to curiosity or trying to pick up some issue for Smart-Phase. I am developing a tool that aims to call haplotypes based on BAM files alone. So, I am trying to learn Smart-Phase and other similar tools to make sure that I fully understand what is already out there and what else I could pick up and contribute a little.

Best regards, Jie

jielab commented 3 years ago

I now manually changed the genotype data for the second SNP, so that it is consistent with the BAM file. Please see screenshot below.

00

Now, I re-run Smart-Phase, somehow the output _sp.vcf.gz file does not contain any SNP data. I assume this means that the _sp.vcf.gz file agrees with the input VCF file and therefore feels "no need" to output the phase information again.

Please kindly confirm this is the case.

Best regards, Jie

paulhager commented 3 years ago

I now manually changed the genotype data for the second SNP, so that it is consistent with the BAM file. Please see screenshot below.

00

Now, I re-run Smart-Phase, somehow the output _sp.vcf.gz file does not contain any SNP data. I assume this means that the _sp.vcf.gz file agrees with the input VCF file and therefore feels "no need" to output the phase information again.

Please kindly confirm this is the case.

Best regards, Jie

Hi Jie,

SmartPhase has no output here because it needs at least two heterozygous variants within the specified regions to phase. As your file only has one heteroygous variant (and one homozygous), there is nothing to be phased. So SmartPhase does nothing.

jielab commented 3 years ago

Ok, I see.

paulhager commented 3 years ago

Try it again with at least two heterozygous variants, and if the issues persist, please don't hesitate to open another ticket!

We can look into what the output of SmartPhase is like without valid input, but that's fairly low priority for us right now.