paulhager / smart-phase

A comprehensive and intelligent clinical phasing tool
GNU General Public License v3.0
13 stars 2 forks source link

Exception in thread "main" java.lang.Exception: While parsing filtered variants #4

Closed jielab closed 3 years ago

jielab commented 3 years ago

Hi, there:

I am trying to use smart-Phase to phase a BAM file, without using any family data or existing phased data that exist in VCF. The following command works: java -jar smartPhase.jar \ -g ./BED/allGeneRegionsCanonical.HG19.GRCh37.bed \ -a ./UseCase/CEU_UseCase.vcf.gz -p NA12878 \ -r ./UseCase/CEU_UseCase.bam -m 60 \ -o output.tsv

And I got the following output. ACTRT2-1-2938045-2939467 1-2938924-T-G 1-2938989-G-A 2 0.8081305751896183 CLDN19-1-43198763-43205925 1-43201534-G-A 1-43201614-C-T 1 0.8050514884016794 PCSK4-19-1481426-1490407 19-1487195-G-A 19-1490285-G-A 4 0.0 Denovo count: 0 Cis count: 1 Avg dist between cis: 80.0 Trans count: 1 Avg dist between trans: 65.0 Newblock count: 1 Contradiction count: 0

I saw the CEU_UseCase.vcf.gz file include 6 SNPs for 3 samples. The 6 SNPs are the same as those 6 SNPs listed in the above output file. So, my understanding is that the VCF is used only to list the SNPs to be phased, correct? If I excluded the -a ./UseCase/CEU_UseCase.vcf.gz -p NA12878 option from the above command, it won't run. Instead, it simply gave me a usage: Welcome to SmartPhase! page. Isn't there a simpler way to run smart-Phase, without using a VCF file? Can't I simply specify a list of SNPs in a text file?

I found there are > 30,000 SNPs in the BED file. I guess this file listed all genes in human genome. Again, I have no idea why we need a file like this, if I simply want to phase 6 SNPs as used in the example. If i excluded the -g ./BED/allGeneRegionsCanonical.HG19.GRCh37.bed option from the above command, I got an error of Exception in thread "main" java.lang.Exception: While parsing filtered variants

Your reply would be greatly appreciated!

Best regards, Jie

timjeske commented 3 years ago

Dear Jie,

thank you for your interest in SmartPhase.

The VCF file given via the -a option should list all variants that are used to phase your variants of interest. There are different ways to define your variants of interest:

As soon as we have some time, we will adapt the error messages so that they are more meaningful. I hope my explanations are helpful until then.

Best regards, Tim

paulhager commented 3 years ago

Hi Jie,

Tim explained it quite well, but to just quickly expand on our thoughts behind the process, this tool was primarily developed to be used in clinical pipelines. There, every patient has a canonical VCF file containing all of their variants (the file which smartphase expects under the -a flag). Because usually not all variants are going to be interesting for the biological / medical question being investigated, SmartPhase gives you the option to only phase a subset of these variants, either through the -f flag where you specify pairs that are interesting for you or the -g flag where you specify regions which are interesting for your analysis.

This is also what sets us apart from other phasing tools in that we don't force whole chromosome phasing on our users, but let them decide which parts of their patient genomes (i.e. which genes or candidate variant pairs) are interesting for their research. This gives them the ability to phase the interesting bits in seconds instead of waiting hours for the entire chromosome or having to cut their vcf files every time they want to phase a gene or a subset of variants.

If you wanted to just phase everything in the vcf file, a bed file that follows this format:

chrom chromStart chromEnd name

chr1 1 249213345 every1 ... chr19 1 59095762 every19 ...

Would accomplish that and be more succinct. We could look into having a phasing of the entire -a file being the default behavior for the next release and improving the error messages.

This is also discussed in our paper (https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007613) and the documentation so I would recommend you browse through that if your still interested in some of our design choices.

Thanks for your input! If you have any more questions please let us know.

Best, Paul

jielab commented 3 years ago

Dear Tim & Paul:

Thank you very much! As you know, there is only one sample in a BAM file and there is even no sample ID within it. That is why I got confused when I had to provide a vcf.gz file and a sample ID through the -p option.

I understand that Smart-phase could utilize trio phasing information beyond what is simply in the genetic data, therefore, VCF files and sample IDs make sense in that situation.

For now, I only want to use Smart-phase to phase some target SNPs in a single BAM file, for two or more SNPs that I know their CHR:POS. Since I have no idea what the alleles are in the BAM file for my target SNPs, I would not be able to write something such as NA12878 1-2938924-T-G 1-2938989-G-A. Is it OK if I only write something such as NA12878 1-2938924 1-2938989 and call it through the -f option? If this "-f" option could replace the the "-g" option for a VCF file, then the sample ID NA128781 would not mean anything, correct?

It would be great if you could throw me an example code, the simplest one-liner, to run Smart-phase on a haplotype defined by two SNPs such as 1-2938924 1-2938989 . Can I use smart-phase to phase 3 SNPs such as 1-1000 1-1100 1-1200?

Thank you very much!

Best regards, Jie

jielab commented 3 years ago

Dear Tim and Paul:

Please refer to the error message below. Does this mean that a SNP must be polymorphic for smart-phase to run? If so, I feel this is something very inconvenient. It would be good to still proceed, since technically a non-polymorphic SNP would be even easier to "phase" a haplotype.

Best regards, Jie

捕获

timjeske commented 3 years ago

Dear Jie,

the -p option is very helpful when the VCF file contains multiple samples. For our clinical application, we have VCF files with all patients but often we just want to phase some pairs for a specific individual without spitting the VCF before phasing.

So far, I could not really understand why you would like to phase genomic positions even if there might be no variation? A SNP is defined as a polymorphism, so I'm not sure what you understand as a non-polymorphic SNP? The error message you've posted is absolutely expected as a VCF file requires the fields CHROM, POS, ID, REF, ALT, QUAL, FILTER and INFO as described in the specifications. To implement what you like to do, you should first perform variant calling with some tool of your choice for the genomic positions you are interested in and then run SmartPhase.

Best regards, Tim

jielab commented 3 years ago

Thanks Tim!

Of course I would not phase a 2-SNP haplotype "on purpose" if I knew that one or both SNPs are non-polymorphic.

If I trust the phasing presented in a VCF file, I would not need to run Smart-phase or any other phasing software. Assuming that I just got a WGS BAM file for one patient. In this case, usually phasing is not provided or not accurate at all if I only have one sample or a few samples. Then let's say that I want to find out the 2-SNP based haplotype (such as APOE) for this sample, and I run smart-Phase on it. Let's say that the haplotype for this sample is T-T and T-C, which is quite common. Instead of getting a result like this, I got an error message from smart-Phase ....

That is my question.

Best regards, Jie

timjeske commented 3 years ago

Dear Jie,

thanks for clarifying your question.

I guess you are searching for some haplotyping solution based on predefined loci. Let's say it is known that position 1, 5 and 10 are polymorphic. Some sample might have the following alleles T/T for position 1, T/C for position 5 and A/G for position 10. Then you like to have a result like this: Haplotype 1 is T-C-A Haplotype 2 is T-T-G

Am I correct?

This can be done by SmartPhase but you need to do variant calling first. SmartPhase is only desigend for phasing variants as there is no need for performing additional variant calling. Haplotyping based on genomic positions and a BAM file would be an interesting extension for SmartPhase but it is not planned to implement this in the near future.

However, I would recommend to do the following steps (should be easy to implement):

  1. Do variant calling on your BAM file (or only the region of the BAM file you are interested in or even only on the SNP positions). Using my example, you will then find that position 5 and position 10 are heterozygous variants (assuming that T and A are the reference alleles).
  2. Run SmartPhase on both heterozygous variants and check the phasing result: a cis call would correspond to the haplotype T-A/C-G, a trans call to T-G/C-A).
  3. Extract the haplotypes as shown above (the homozygous T/T can easily be attached).

I hope this helps!

Best regards Tim

jielab commented 3 years ago

Dear Tim:

Thank you very much for your detailed clarification.

Now this all makes sense to me :-)

Best regards, Jie

timjeske commented 3 years ago

I'm happy that we could help!

I will close this issue for now but don't hesitate to ask us again whenever something is unclear.

Best regards, Tim