sanger-pathogens / snp-sites

Finds SNP sites from a multi-FASTA alignment file
http://sanger-pathogens.github.io/snp-sites/
Other
236 stars 50 forks source link

snp-sites for all ~2 million SARS-COV-2 genomes in GISAID #104

Open jielab opened 3 years ago

jielab commented 3 years ago

Dear Andrew:

Can I use snp-sites to process the full FASTA data that I downloaded from GISAID, which has about 1,000,000,000 lines in total for ~2 million SARS-COV-2 genomes?

I run it on my local laptop and the job got killed. I could try to run it on a server. But I would like to confirm with your first that it is something doable. I guess that I only need to run "snp-sites -vp -o output " to output a VCF file. I should NOT specify "-p" because generating a phylip file for ~2 million genomes might take forever.

BTW, I had my PhD study at the Sanger Insitute, from 2012-2015.

Best regadrs, Jie

Salvobioinfo commented 2 years ago

I’d like to use this tool for the same reason, but I’m scare that if the tool calls the SNPs using internal pseudo reference genome (I think the consensus sequence), it makes little sense. e.g. For the GISAID msa file, where the variant D614G is present in almost all sequences, it will be recognised as WT and not as variant.