Closed BertBog closed 6 years ago
The problem, as you guessed is a 'bad' cluster, or more specifically a 'bad' strain. It is this one
This means that strain lm11_1 is -1.759*(standard deviation of the pairwise distance) from the cluster median pairwise distance distance. e.g. if the median pairwise distance of strains within this cluster is 10 and the standard deviation is 2, then this strain has a median pairwise distance of 6.5 SNPs from the other strains.
This is a bad thing because if a strain is too close to all the other strains in your cluster it might mean that you have sequenced a mixed culture (one of the strains in the mix is in your cluster, the other is not). This will lead to 'N's instead of SNPs at variant positions, which will make this strain appear closer than expected to all the other strains in the cluster. This has the potential to 'collapse' clusters and create big headaches.
If you want to troubleshoot, then 'get_the_snps' for the strain that is causing trouble, lm11_1, and 5 or 6 other strains at varying distances from lm11_1. Then visually inspect the alignment, if there are loads of singleton Ns scattered through the lm11_1 sequence, then the strain could be a mix and maybe should be ignored. If there are just a few Ns, or they are clustered (which may mean that part of the ref genome is not well covered/present in lm11_1), then you can allow the strain into your clusters.
You do this by updating the 'zscore_check' field in strain_stats table to be 'Y', using e.g. pgAdmin. This means you have checked a strain which tripped the zscore, and it's ok.
The cutoff is -1.75, so your strain has just tripped it, and is probably fine.
This is based on my slightly rusty memory, @timdallman should probably confirm.
Alright, thanks a lot for this clear explanation! This makes a lot of sense. I've managed to add the remaining strains to the database. There were indeed quite a lot of N's in the SNP matrix for this sample.
Thanks for the perfect explanation Phil - I appreciate this is not explained in the docs yet and will rectify as soon as possible. Tim
Oh, and the final thing to say is that if you decide to ignore it, then you should update the 'ignore' column in strain_stats for this strain. Then, the clustering will run to completion in future if you add more strains that dont have problems (I think).
Hey,
I'm experimenting a bit with SnapperDB and I have a problem adding additional strains to the database. I've successfully created a database with 14 strains. Now when I try to add 6 more, they are not added to the 'strain_clusters' table in the database (they are added to the 'strain_stats' table however).
I used the example commands from the tutorial: fastq_to_db (for all FASTQ files) update_distance_matrix update_clusters
I get the following output:
I think the clusters are somehow flagged as 'bad' clusters, but the previous clusters were really similar and they were successfully added to the db. Do you know what might be the cause of this?