Closed sjaenick closed 1 year ago
Hmm.. that's interesting. What DB of kraken1/2 did you use?
We're creating our own databases, see
https://github.com/MGX-metagenomics/databases/blob/master/kraken.build https://github.com/MGX-metagenomics/databases/blob/master/kraken2.build
I just found it strange that both Kraken versions are confident enough to assign at the species level, while at the same time Kovacikia isn't detected by Metabuli at all.
Just to verify - I downloaded the genome FASTA for https://www.ncbi.nlm.nih.gov/nuccore/NZ_CP083582.1
and split it up into short fragments with cut_up_fasta.py
from CONCOCT (https://github.com/BinPro/CONCOCT/blob/develop/scripts/cut_up_fasta.py).
cut_up_fasta.py -c 300 -o 0 Kovacikia_genome.fas > Kovacikia_genome_fragmented.fas
Current git master doesn't compile, so I compiled Metabuli based on commit a449109.
metabuli classify --seq-mode 1 Kovacikia_genome_fragmented.fas /vol/biodb/local_databases/MGX/metabuli/refseq_release217/ Kovacikia_genome_fragmented job --min-sp-score 0.5 --min-score 0.15 --threads 10
(No 'overflow!!!' warnings here)
And there it is:
99.9589 24317 24317 strain 2653194 Kovacikia minuta CCNUW1
So, this is interesting. Is it Metabuli unable to identify Kovacikia in the reads, or is it Kraken 1 and 2 reporting a false positive in high abundance? I don't know.
I checked the Kovacikia.fq.txt
, and most of them seem like very conserved rRNA.
Metabuli found many species matching the reads, so the reads were classified to LCA of the species.
So most of the reads were classified at the superkingdom or class rank.
Kraken2 also classifies the reads to LCA of equally good matching species, so the same situation should occur with the same DB. But genomes included in your Kraken2 DB are different to RefSeq release 217.
I have Kraken2 DB and Metabuli DB that have been built using the same genomes, so I tested using the DBs.
## Metabuli
0.4000 4 4 no rank 0 unclassified
99.7000 997 0 no rank 1 root
99.7000 997 0 superkingdom 8034 d__Bacteria
99.7000 997 0 phylum 154651 p__Cyanobacteria
99.7000 997 128 class 154652 c__Cyanobacteriia
86.9000 869 708 order 163398 o__Cyanobacteriales
15.5000 155 0 family 203425 f__Coleofasciculaceae
15.5000 155 0 genus 287014 g__PCC7113
15.5000 155 0 species 287015 s__PCC7113 sp000317515
15.5000 155 155 subspecies 287016 RS_GCF_000317515.1
0.4000 4 0 family 181590 f__Microcoleaceae
0.2000 2 1 genus 188308 g__Limnospira
0.1000 1 1 species 188309 s__Limnospira fusiformis
0.2000 2 0 genus 262813 g__Trichodesmium
0.2000 2 0 species 283287 s__Trichodesmium erythraeum
0.2000 2 2 subspecies 283288 RS_GCF_000014265.1
0.1000 1 0 family 166780 f__Nostocaceae
0.1000 1 0 genus 316381 g__HT-58-2
0.1000 1 0 species 316382 s__HT-58-2 sp002163975
0.1000 1 1 subspecies 316383 RS_GCF_002163975.1
0.1000 1 0 family 189803 f__Cyanobacteriaceae
0.1000 1 0 genus 189804 g__Limnothrix
0.1000 1 1 species 189805 s__Limnothrix sp001693275
## Kraken2
1.50 15 15 U 0 unclassified
98.50 985 0 R 1 root
98.50 985 0 D 8034 d__Bacteria
98.50 985 0 P 154651 p__Cyanobacteria
98.50 985 0 C 154652 c__Cyanobacteriia
98.50 985 877 O 163398 o__Cyanobacteriales
10.30 103 0 F 203425 f__Coleofasciculaceae
10.30 103 0 G 287014 g__PCC7113
10.30 103 0 S 287015 s__PCC7113 sp000317515
10.30 103 103 S1 287016 RS_GCF_000317515.1
0.30 3 0 F 166780 f__Nostocaceae
0.20 2 0 G 242292 g__Aulosira
0.20 2 0 S 287680 s__Aulosira sp002368175
0.20 2 2 S1 287681 RS_GCF_002368175.1
0.10 1 0 G 316381 g__HT-58-2
0.10 1 0 S 316382 s__HT-58-2 sp002163975
0.10 1 1 S1 316383 RS_GCF_002163975.1
0.20 2 0 F 181590 f__Microcoleaceae
0.20 2 0 G 262813 g__Trichodesmium
0.20 2 0 S 283287 s__Trichodesmium erythraeum
0.20 2 2 S1 283288 RS_GCF_000014265.1
Both Metabuli and Kraken2 classified most reads at the ranks above family. And the RS_GCF_000317515.1 was the most abundant strain in both results. So, I think the difference you found is due to using DBs built from different genomes.
Interesting, thanks for crosschecking.
Hi @jaebeom-kim,
I'm attaching a FASTQ file with sequences from a public dataset; applying Kraken 1 or Kraken 2, these are getting classified as 'Kovacikia minuta' at the species level. The dataset is from a microbial mat obtained from corals, and these are rich in Cyanobacteria, so the assignment to Kovacikia seems to make sense. (With both Kraken versions, Kovacikia is actually the dominant genus in this dataset.)
Using Metabuli with the RefSeq217 database, I don't get any Kovacikia hits at all; according to the
database-report
workflow, this genome is present in the database, though. Instead, there are very few reads classified below 'class' rank, and these are for different Cyanobacteriota.Can you take a look?
Thanks,
Kovacikia.fq.txt