mgalardini / pyseer

SEER, reimplemented in python 🐍🔮
http://pyseer.readthedocs.io
Apache License 2.0
104 stars 25 forks source link

snp.vcf file #233

Closed hfaoro closed 1 year ago

hfaoro commented 1 year ago

Dear,

I'm following a tutorial doing an analysis of 800 genomes with a binary phenotype. I'm having trouble understanding the source of the .vcf file. I used snippy to generate the .vcf against a reference, but now I have 800 .vcf files. Would the "snp.vcf.gz" file be just a compressed version of those 800 vcf? Or do I need to concatenate the 800 vcf before compressing?

thank you in advance

ireneortega commented 1 year ago

I have no idea what you are referring with the "snp.vcf.gz" file be just a compressed version of those 800 vcf.

For sure you have to merge each vcf file into a single vcf file that contains all the SNPs from snippy of each genome. I used to run this command:

bcftools merge -m none -0 -O z $VCF/*.vcf.gz -o $VCF/merged.vcf.gz

where $VCF stands for the directory that contains the 800 .vcf files and merged.vcf.gz will be the merged (similar to what you refer as concatenate") file that you have to use in pyseer for the option --vcf. That file will be store in the same directory where the 800 .vcf files, but you can specify a different directory.

mgalardini commented 1 year ago

Hi, yes, you need to obtain a single VCF file with the variants for all your 800 samples. The suggested bcftools command is a perfectly fine way to do it.

hfaoro commented 1 year ago

Thank you very much. The bcf tool solved the problem.