ukhsa-collaboration / snapperdb

GNU General Public License v3.0
23 stars 5 forks source link

Strains not added to the database #6

Closed BertBog closed 6 years ago

BertBog commented 6 years ago

Hey,

I'm experimenting a bit with SnapperDB and I have a problem adding additional strains to the database. I've successfully created a database with 14 strains. Now when I try to add 6 more, they are not added to the 'strain_clusters' table in the database (they are added to the 'strain_stats' table however).

I used the example commands from the tutorial: fastq_to_db (for all FASTQ files) update_distance_matrix update_clusters

I get the following output:

Namespace(command='update_clusters', config_file='config_listeria.txt', log_dir='/scratch/bebog/snapperdb/snapperdb')
### Cluster level 250 :16:36:12.547415
### Cluster level 100 :16:36:12.547994
### Cluster level 50 :16:36:12.548456
### Cluster level 25 :16:36:12.548889
### Cluster level 10 :16:36:12.549544
### Cluster level 5 :16:36:12.550196
### Cluster level 0 :16:36:12.550756
### Getting previously checked outliers:16:36:12.551623
### Checking Clusters:16:36:12.553243
lm16_1 [1, 1, 1, 1, 1, 2, 17]
### Investigating 250 1
### Average Distance 118.39
### Standard Deviation 128.15204436
### Investigating 100 1
### Average Distance 118.39
### Standard Deviation 128.15204436
### Investigating 50 1
### Average Distance 118.39
### Standard Deviation 128.15204436
### Investigating 25 1
### Average Distance 32.7755102041
### Standard Deviation 7.99150142893
### Outlier Z-Score lm11_1 18.7142857143 -1.75952223932
### Investigating 10 1
### Average Distance 18.48
### Standard Deviation 14.0439310736
### Investigating 5 2
### Average Distance 9.5
### Standard Deviation 8.23356949729
### Investigating 0 17
### Average Distance 0.0
lm20_1 [1, 1, 1, 1, 1, 2, 20]
### Already Investigated 250 1
### Already Investigated 100 1
### Already Investigated 50 1
### Already Investigated 25 1
### Already Investigated 10 1
### Already Investigated 5 2
### Investigating 0 20
### Average Distance 0.0
lm19_1 [1, 1, 1, 2, 2, 3, 16]
### Already Investigated 250 1
### Already Investigated 100 1
### Already Investigated 50 1
### Investigating 25 2
### Average Distance 4.75
### Standard Deviation 2.26384628453
### Investigating 10 2
### Average Distance 4.75
### Standard Deviation 2.26384628453
### Investigating 5 3
### Average Distance 1.33333333333
### Standard Deviation 0.333333333333
### Investigating 0 16
### Average Distance 0.0
lm18_1 [1, 1, 1, 2, 2, 3, 18]
### Already Investigated 250 1
### Already Investigated 100 1
### Already Investigated 50 1
### Already Investigated 25 2
### Already Investigated 10 2
### Already Investigated 5 3
### Investigating 0 18
### Average Distance 0.0
lm17_1 [1, 1, 1, 2, 2, 14, 15]
### Already Investigated 250 1
### Already Investigated 100 1
### Already Investigated 50 1
### Already Investigated 25 2
### Already Investigated 10 2
### Investigating 5 14
### Average Distance 0.0
### Investigating 0 15
### Average Distance 0.0
lm15_1 [1, 1, 1, 4, 12, 15, 19]
### Already Investigated 250 1
### Already Investigated 100 1
### Already Investigated 50 1
### Investigating 25 4
### Average Distance 0.0
### Investigating 10 12
### Average Distance 0.0
### Investigating 5 15
### Average Distance 0.0
### Investigating 0 19
### Average Distance 0.0
### Not Updating Clusters:16:36:12.555282
### Completed 2017-11-07 16:36:12.555548

I think the clusters are somehow flagged as 'bad' clusters, but the previous clusters were really similar and they were successfully added to the db. Do you know what might be the cause of this?

flashton2003 commented 6 years ago

The problem, as you guessed is a 'bad' cluster, or more specifically a 'bad' strain. It is this one

Outlier Z-Score lm11_1 18.7142857143 -1.75952223932

This means that strain lm11_1 is -1.759*(standard deviation of the pairwise distance) from the cluster median pairwise distance distance. e.g. if the median pairwise distance of strains within this cluster is 10 and the standard deviation is 2, then this strain has a median pairwise distance of 6.5 SNPs from the other strains.

This is a bad thing because if a strain is too close to all the other strains in your cluster it might mean that you have sequenced a mixed culture (one of the strains in the mix is in your cluster, the other is not). This will lead to 'N's instead of SNPs at variant positions, which will make this strain appear closer than expected to all the other strains in the cluster. This has the potential to 'collapse' clusters and create big headaches.

If you want to troubleshoot, then 'get_the_snps' for the strain that is causing trouble, lm11_1, and 5 or 6 other strains at varying distances from lm11_1. Then visually inspect the alignment, if there are loads of singleton Ns scattered through the lm11_1 sequence, then the strain could be a mix and maybe should be ignored. If there are just a few Ns, or they are clustered (which may mean that part of the ref genome is not well covered/present in lm11_1), then you can allow the strain into your clusters.

You do this by updating the 'zscore_check' field in strain_stats table to be 'Y', using e.g. pgAdmin. This means you have checked a strain which tripped the zscore, and it's ok.

The cutoff is -1.75, so your strain has just tripped it, and is probably fine.

This is based on my slightly rusty memory, @timdallman should probably confirm.

BertBog commented 6 years ago

Alright, thanks a lot for this clear explanation! This makes a lot of sense. I've managed to add the remaining strains to the database. There were indeed quite a lot of N's in the SNP matrix for this sample.

timdallman commented 6 years ago

Thanks for the perfect explanation Phil - I appreciate this is not explained in the docs yet and will rectify as soon as possible. Tim

flashton2003 commented 6 years ago

Oh, and the final thing to say is that if you decide to ignore it, then you should update the 'ignore' column in strain_stats for this strain. Then, the clustering will run to completion in future if you add more strains that dont have problems (I think).