How should I convert VCF to Phylip format?

pblischak / HyDe

Hybridization detection using phylogenetic invariants

http://hybridization-detection.readthedocs.io

MIT License

41 stars 14 forks source link

How should I convert VCF to Phylip format? #15

Closed silvewheat closed 4 years ago

silvewheat commented 4 years ago

Hello,

Since HyDe use sequence data in Phylip format as input. But I only have a VCF file. I'm not sure what is the best way to convert a VCF to sequence data. Specifically, I have the following questions:

Should I keep the non-variant sites (that not exist in VCF file but in the reference fasta file) and genrate a consensus sequence? Or I just need use only sites in the VCF file?
Should I convert each diploid individual to two haploid sequence? Or I can just treat diploid as one individual and use the ambiguous nucleotides in heterozygous sites?

Beast, Yudong

pblischak commented 4 years ago

Hi Yudong,

For converting to Phylip format you should be able to just use the variant sites in your VCF file. Also, I think that keeping both alleles for the diploids would be better as long as you aren't trying to run the individual-level tests. Because HyDe assumes independence among sites, it shouldn't matter which allele you assign to the two haploid individuals you make from the diploid VCF file. If you are wanting to run the test on individuals though, then I think using ambiguity codes for heterozygous sites and coding each individual as a single consensus sequence would be best

Hopefully this helps but let me know if you have any other questions

Paul

silvewheat commented 4 years ago

Hi Paul,

Thanks for you answers. I'll try as your suggestion.

Best, Yudong

zsdxgl commented 1 year ago

There are always missing genotypes in VCF files. How to deal with these sites? Whether the program HyDe can handle "N"?

pblischak commented 1 year ago

Hi @zsdxgl -- HyDe can handle missing data ('N'). It does this by trying to integrate over the uncertainty that missing data creates by assigning 0.25 to each of the possible nucleotides