steineggerlab / Metabuli

Metabuli: specific and sensitive metagenomic classification via joint analysis of DNA and amino acid.
GNU General Public License v3.0
118 stars 10 forks source link

Species missing from Metabuli, but detected with Kraken 1/2 #29

Closed sjaenick closed 1 year ago

sjaenick commented 1 year ago

Hi @jaebeom-kim,

I'm attaching a FASTQ file with sequences from a public dataset; applying Kraken 1 or Kraken 2, these are getting classified as 'Kovacikia minuta' at the species level. The dataset is from a microbial mat obtained from corals, and these are rich in Cyanobacteria, so the assignment to Kovacikia seems to make sense. (With both Kraken versions, Kovacikia is actually the dominant genus in this dataset.)

Using Metabuli with the RefSeq217 database, I don't get any Kovacikia hits at all; according to the database-report workflow, this genome is present in the database, though. Instead, there are very few reads classified below 'class' rank, and these are for different Cyanobacteriota.

Can you take a look?

Thanks,

Kovacikia.fq.txt

jaebeom-kim commented 1 year ago

Hmm.. that's interesting. What DB of kraken1/2 did you use?

sjaenick commented 1 year ago

We're creating our own databases, see

https://github.com/MGX-metagenomics/databases/blob/master/kraken.build https://github.com/MGX-metagenomics/databases/blob/master/kraken2.build

I just found it strange that both Kraken versions are confident enough to assign at the species level, while at the same time Kovacikia isn't detected by Metabuli at all.

sjaenick commented 1 year ago

Just to verify - I downloaded the genome FASTA for https://www.ncbi.nlm.nih.gov/nuccore/NZ_CP083582.1 and split it up into short fragments with cut_up_fasta.py from CONCOCT (https://github.com/BinPro/CONCOCT/blob/develop/scripts/cut_up_fasta.py).

cut_up_fasta.py -c 300 -o 0 Kovacikia_genome.fas > Kovacikia_genome_fragmented.fas

Current git master doesn't compile, so I compiled Metabuli based on commit a449109.

metabuli classify --seq-mode 1 Kovacikia_genome_fragmented.fas /vol/biodb/local_databases/MGX/metabuli/refseq_release217/ Kovacikia_genome_fragmented job --min-sp-score 0.5 --min-score 0.15 --threads 10

(No 'overflow!!!' warnings here)

And there it is:

99.9589 24317   24317   strain  2653194                       Kovacikia minuta CCNUW1

So, this is interesting. Is it Metabuli unable to identify Kovacikia in the reads, or is it Kraken 1 and 2 reporting a false positive in high abundance? I don't know.

jaebeom-kim commented 1 year ago

I checked the Kovacikia.fq.txt, and most of them seem like very conserved rRNA. Metabuli found many species matching the reads, so the reads were classified to LCA of the species. So most of the reads were classified at the superkingdom or class rank.

Kraken2 also classifies the reads to LCA of equally good matching species, so the same situation should occur with the same DB. But genomes included in your Kraken2 DB are different to RefSeq release 217.

I have Kraken2 DB and Metabuli DB that have been built using the same genomes, so I tested using the DBs.

## Metabuli
0.4000  4   4   no rank 0   unclassified
99.7000 997 0   no rank 1   root
99.7000 997 0   superkingdom    8034      d__Bacteria
99.7000 997 0   phylum  154651      p__Cyanobacteria
99.7000 997 128 class   154652        c__Cyanobacteriia
86.9000 869 708 order   163398          o__Cyanobacteriales
15.5000 155 0   family  203425            f__Coleofasciculaceae
15.5000 155 0   genus   287014              g__PCC7113
15.5000 155 0   species 287015                s__PCC7113 sp000317515
15.5000 155 155 subspecies  287016                  RS_GCF_000317515.1
0.4000  4   0   family  181590            f__Microcoleaceae
0.2000  2   1   genus   188308              g__Limnospira
0.1000  1   1   species 188309                s__Limnospira fusiformis
0.2000  2   0   genus   262813              g__Trichodesmium
0.2000  2   0   species 283287                s__Trichodesmium erythraeum
0.2000  2   2   subspecies  283288                  RS_GCF_000014265.1
0.1000  1   0   family  166780            f__Nostocaceae
0.1000  1   0   genus   316381              g__HT-58-2
0.1000  1   0   species 316382                s__HT-58-2 sp002163975
0.1000  1   1   subspecies  316383                  RS_GCF_002163975.1
0.1000  1   0   family  189803            f__Cyanobacteriaceae
0.1000  1   0   genus   189804              g__Limnothrix
0.1000  1   1   species 189805                s__Limnothrix sp001693275
## Kraken2
  1.50  15  15  U   0   unclassified
 98.50  985 0   R   1   root
 98.50  985 0   D   8034      d__Bacteria
 98.50  985 0   P   154651      p__Cyanobacteria
 98.50  985 0   C   154652        c__Cyanobacteriia
 98.50  985 877 O   163398          o__Cyanobacteriales
 10.30  103 0   F   203425            f__Coleofasciculaceae
 10.30  103 0   G   287014              g__PCC7113
 10.30  103 0   S   287015                s__PCC7113 sp000317515
 10.30  103 103 S1  287016                  RS_GCF_000317515.1
  0.30  3   0   F   166780            f__Nostocaceae
  0.20  2   0   G   242292              g__Aulosira
  0.20  2   0   S   287680                s__Aulosira sp002368175
  0.20  2   2   S1  287681                  RS_GCF_002368175.1
  0.10  1   0   G   316381              g__HT-58-2
  0.10  1   0   S   316382                s__HT-58-2 sp002163975
  0.10  1   1   S1  316383                  RS_GCF_002163975.1
  0.20  2   0   F   181590            f__Microcoleaceae
  0.20  2   0   G   262813              g__Trichodesmium
  0.20  2   0   S   283287                s__Trichodesmium erythraeum
  0.20  2   2   S1  283288                  RS_GCF_000014265.1

Both Metabuli and Kraken2 classified most reads at the ranks above family. And the RS_GCF_000317515.1 was the most abundant strain in both results. So, I think the difference you found is due to using DBs built from different genomes.

sjaenick commented 1 year ago

Interesting, thanks for crosschecking.