mdshw5 / pyfaidx

Efficient pythonic random access to fasta subsequences
https://pypi.python.org/pypi/pyfaidx
Other
449 stars 75 forks source link

FastaVariant for checking the reference genome for the VCF file #209

Closed shelpuk closed 1 year ago

shelpuk commented 1 year ago

Wonderful tool! Thanks a lot for building and maintaining it!

I need to verify if the reference genome fits the VCF file (i.e., if the REF value in the CHROM and POS in the VCF file matches the nucleotide in the FASTA file of the reference genome). I was trying to do that with the FastaVariant class but cannot turn my head around it. Could you please help me to understand if this is the right tool for my task?

Thank you!

mdshw5 commented 1 year ago

Thanks for raising the question. After considering your task, I think this package will not help you validate the FASTA matches your VCF. I’m assuming you’ve checked that your VCF doesn’t have a “reference” entry in the header?: https://github.com/samtools/hts-specs/blob/144e32acb582b414a281bf9dc06223b43609a489/VCFv4.1.tex#L32

Even with matching contig names in both the FASTA and VCF files, it’s still possible that a different version or patch of the genome assembly might have been used.