tseemann / snp-dists

Pairwise SNP distance matrix from a FASTA sequence alignment
GNU General Public License v3.0
126 stars 28 forks source link

Best snippy output file to use? #41

Open pgcudahy opened 4 years ago

pgcudahy commented 4 years ago

Silly question, but is it preferable to use core.aln or core.full.aln from snippy-core as the input to snp-dists, or even some other file output from snippy?

andersgs commented 4 years ago

@pgcudahy depends on what you need. If you want to calculate distances with all the different N and - that you will see in the core.full.aln, then use that. If you just want a clean core SNP distance, then use core.aln.

pgcudahy commented 4 years ago

Thanks for the considered response. Ultimately I want snp distances for looking at transmission. It seemed to me that core.aln was better but the reason I had asked is I've been having trouble creating core.aln because I have hundreds of samples and the lowest quality ones either make snippy-core fail to create the core.aln file or result in a very short core genome. Are there any rules of thumb for criteria to filter the individual aligned fasta files (snps.aligned.fa) to get a good core.aln? With a reference length of ~ 4,100,000 bp, what roughly would a good quality core genome length be?

andersgs commented 4 years ago

@pgcudahy see my reply in your other issue.

Perhaps a better strategy might be to be ruthless and remove everything that is remotely dodgy. Maybe you end up with just 100 samples or even fewer. But, it gives you a starting point to go from end to end with your analysis with confidence in the data.

Once you have a good idea of how things should look and behave with a high quality subset, you can start relaxing your criteria for inclusion. This way, you have a solid baseline to compare to in order to evaluate different inclusion/exclusion criteria. Keep in mind that the core SNPs will change depending on the samples you include.

My limited experience with transmission analysis in TB using genomics is that every SNP counts. People are ruthless in excluding them using stringent criteria for variant filtering, and samples need to be well justified to be included in regards to their quality.

To answer your question, the size of the genome is not really important in terms of what you might expect the core genome size to be. As I mentioned in the other issue, the diversity of the sample, the quality and evolutionary distance of the reference genome, and the quality and number of samples will all play a role. Additionally, repeat elements, mobile elements, recombination (not a big issue in TB, AFAIK), and any other genomic elements not present in the MRCA of the samples (and therefore not vertically inherited by all samples in your collection) will affect your core genome.

Best of luck.

pgcudahy commented 3 years ago

OK, thanks again for your very patient and thoughtful answers. I did as you suggested and used fastp to clean up the fastq files and kraken2 to filter out the non-Mtb labelled reads from the fastq files. Then I removed the samples with <40x depth and finally ran the remaining files through snippy and snp-dists. Now I get a core genome of 7325bp which still seems short for a reference of > 4,000,000 bp. To check this, I did 100 trials where I picked ten samples at random and ran them through snp-dists and got the core genome length in base pairs. The distribution was

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1209    2117    2408    2448    2768    4024 

image

So it looks like my final result is reasonable for my samples. Does this seem right to you, to have a core genome that short?