simonhmartin / genomics_general

General tools for genomic analyses.
341 stars 93 forks source link

Error with parseVCF.py #111

Open cycarthur opened 7 months ago

cycarthur commented 7 months ago

Hi Simon,

I used parseVCF.py to convert my VCF to geno file, but the generated file had more than one nucleotide for some of the locations, which I suspected to cause more errors when I tried to input the file into phyml_sliding_windows.py. Examples as below:

Chr01 97 N/N G/G A|A
Chr01 100 N/N T/T T|T
Chr01 104 N/N TTA/TTA TTA/TTA TTA/TTA Chr01 116 N/N G/G G/G G/G

Do you know if there is anyway to prevent this from happening or filter out those rows? Thank you.

Best, Arthur

simonhmartin commented 7 months ago

Hi Arthur, Looks like there are multi-base genotypes in your vcf file. This could be caused by indels, or by using a genotyper that calls multi-base variants. If it's indels, these can be removed with a tool like bcftools filter. If it's multi-site variants (i.e. where the reference and the alternate allele both consist of multiple bases), you will need to look at what people do to simplify outputs from the genotyper you have used. Simon