MLST vs kraken identification conflict

rajaldebnath commented 4 years ago

Hello Prof, I recently sequenced a genome with Illumina paired end 2x250 bp. An estimated genome size with trimmed processed dataset was obtained to be 4.78 Mb. The spades assembled genome with error correction mode-on resulted in 25 contigs. (14 contigs >=1000 bp with kmer cov ~ 20 or higher for all, rest 9 contigs had less than 1, kmer coverage). NODE_4_length_250101_cov_23.115820 NODE_9_length_29209_cov_24.448112 NODE_6_length_142243_cov_23.106251 NODE_3_length_498732_cov_23.575169 NODE_1_length_2818498_cov_22.643412 NODE_2_length_704858_cov_23.240266 NODE_10_length_26310_cov_22.249780 NODE_5_length_173468_cov_24.367282 NODE_8_length_83392_cov_23.495106 NODE_7_length_98826_cov_22.383347 NODE_11_length_2595_cov_167.377229 NODE_12_length_1630_cov_170.069195 NODE_13_length_1025_cov_90.433185 NODE_14_length_929_cov_49.255611 NODE_15_length_484_cov_0.803922 NODE_16_length_460_cov_0.738739 NODE_17_length_418_cov_0.838488 NODE_18_length_416_cov_0.671280 NODE_19_length_412_cov_0.782456 NODE_20_length_412_cov_0.750877 NODE_21_length_411_cov_0.700704 NODE_22_length_407_cov_0.625000 NODE_23_length_402_cov_0.789091 NODE_24_length_395_cov_0.634328 NODE_25_length_378_cov_0.920319

While the reads on taxonomic classification with kraken database had the following report 80.52 587259 92719 G 547 Enterobacter 63.00 459511 149822 G1 354276 Enterobacter cloacae complex 19.39 141394 128956 S 550 Enterobacter cloacae . . . 12.83 93585 57450 S 61645 Enterobacter asburiae

But the assembled genome in mlst analysis identified it to be E cloacae. ecloacae,-,dnaA(244),fusA(25),gyrB(~234),leuS(254),pyrG(204),rplB(104),rpoB(27) [11:17:29] Found exact allele match ecloacae.pyrG-204 [11:17:29] Found exact allele match ecloacae.leuS-254 [11:17:29] Found exact allele match ecloacae.dnaA-244 [11:17:29] Found exact allele match ecloacae.fusA-25 [11:17:29] Found exact allele match cronobacter.fusA-59 [11:17:29] Found exact allele match ecloacae.rpoB-27 [11:17:29] Found exact allele match ecloacae.rplB-104

I also did a Lastz alignment with all the complete Enterobacter genus genomes available in NCBI by importing into Galaxy. Based on least number of alignment blocks and percent identity above 90, the assembled genome suggested it to be Enterobacter asburiae. I also did a nucmer dotplot analysis with the two genomes and also with E cloacae genome. On reordering the contigs using Mauve the blocks were highly syntenic with E asburiae than E cloacae.

The 16S fragment however showed classification to E cloacae.

I am confused about the identity of the bacterium, whether it is E_cloacae or E_asburiae and how to select closest matches for further downstream comparative analysis.

Looking for your suggestions. Rajal

lskatz commented 4 years ago

ANI is more definitive. I'd want to know the ANI results when comparing against all these Enterobacter species.

tseemann commented 4 years ago

Maybe use https://github.com/ParBLiSS/FastANI

tseemann commented 4 years ago

I built a mashtree of all Enterobacter genomes in Refseq. The taxonomy looks very messy. Recombination is a big factor of course. Look at the attached Newick file in your favourite tree viewing software.

enterobacter.nwk.txt

rajaldebnath commented 4 years ago

Hi @lskatz @tseemann ,

FastANI values were obtained for all the Enterobacter genomes (150 nos.) in the refseq database. The closest one is Enterobacter asburiae (AEB30) - 98.3303 I am attaching the FastANI output file here.

FastANI_sorted.txt

rajaldebnath commented 4 years ago

I built a mashtree of all Enterobacter genomes in Refseq. The taxonomy looks very messy. Recombination is a big factor of course. Look at the attached Newick file in your favourite tree viewing software.

enterobacter.nwk.txt

So, recombination and rearrangements can significantly impact mlst result? Will it help if I filter the contigs with length less than 1000bp and kmer cov less than 1. Can I discard them overall and go with prokka annotations and take Enterobacter asburiae as closest genome for downstream analysis

I really appreciate your suggestions in my analysis.

lskatz commented 4 years ago

If your diversity was good( and it sounds that way) then it looks like clear results with about 98% ANI and the next closest around 94%. I think it's that species!

tseemann / mlst

MLST vs kraken identification conflict #94