nickjcroucher / gubbins

Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins
http://nickjcroucher.github.io/gubbins/
GNU General Public License v2.0
170 stars 49 forks source link

How to eliminate false-positive #283

Closed Noornoor440 closed 10 months ago

Noornoor440 commented 4 years ago

Hi,

I ran Gubbins in 35 E. coli samples. I got around 14 samples with SNPs <=10. However, classical MLST based on the 7 housekeeping genes are not consistent with the high clonality rate I got from the SNPs. The detection of recombinations based on SNP density seems not able to distinguish heterogeneity rate and interspecific recombination. One of the seven genes was missing in the alignment file I have obtained from Gubbins which was not the case in the alignment file I obtained from snippy.

I tried to introduce different values for minimum substitution from the default 3, in the hope this might decrease the stringent in detecting the recombination, no success yet. The masking would not help either unless there is a way to ask gubbins to ignore these 7 genes and not to filter them out as a putative combination site!

I wonder if someone else facing a similar issue and figure how to sort it.

Thanks,

Hissa

matbeale commented 4 years ago

Hi Hissa,

I'll try to help, but could you please provide a little more information on your exact problem? 14 samples with <10 SNPs is quite low for unrelated E.coli genomes - are these from the same patient or a putative transmission network? What is the SNP distance based on - reference alignment? Are you running gubbins on a full genome length reference alignment?

When you say the MLST gene is 'missing', exactly what do you mean? It's not unheard of for an MLST locus to be subject to recombination, and if correct then gubbins would be expected to remove sites in that gene (leading to them being masked as 'N' in the polymorphic sites output). By default, gubbins doesn't output a full length masked multiple seqeunce alignment, only masked polymorphic sites, so one would not expect be able to find 'genes' in that file. If you've reconstructed the full length WGS alignment, how was this done, and could the issue be there?

Is there a specific reason you are trying to recover MLST genes from a gubbins filtered alignment, or is this just for testing purposes?

Thanks, Mat

Noornoor440 commented 4 years ago

Thanks Mat for responding,

The collection is quite diverse gentically from the classical MLST (classical 7 house-keeping genes) data which yeilded 22 distinct STs. The samples were obtained for surveillance purpose and sequenced fully with Illumina. Initially, I called the variants with Snippy against complete reference. Snippy did not filter out the MLST genes. Then I ran Gubbins and got a matrix with SNPs-dist. It was sloppy to BLAST gubbins output, thanks for pointing to this Mat.

What looks weird to me is to have isolates with different STs but were < 10 SNPs apart. Agree recombination within MLST is not a rare event but less frequent in E. coli and other typing approach suggest some of the isolates are not close genetically which adds up to my assumption I have a false positive. I thought by altering some of the default values I could fix the issue but did not work so far.

Thanks,

Hissa