Much lower number of classified reads compared to kraken2

bluenote-1577 commented 1 year ago

Hi there,

Thanks for building this software. I'm quite enjoying using it.

I had a question about benchmarking ganon. I am currently building a custom version of the GTDB-R89 database with ganon, and running simulated paired reads against this database. I get the following log

ganon classify -d {params[0]} -p {rf} {rb} -o {params[1]} -t 50 
- - - - - - - - - -
   _  _  _  _  _   
  (_|(_|| |(_)| |  
   _|   v. 1.9.0
- - - - - - - - - -
Classifying reads

ganon-classify processed 9991993 sequences (2997.6 Mbp) in 128.605 seconds (1398.52 Mbp/m)
 - 1128970 reads classified (11.2987%)
   - 926469 with unique matches (9.27211%)
   - 202501 with multiple matches (2.02663%)
 - 1768307 matches (avg. 1.5663 match/read classified)
 - 8863023 reads unclassified (88.7013%)
- - - - - - - - - -
Generating report(s)

Total valid files: 1

 - 2180 entries removed not in --ranks [superkingdom,phylum,class,order,family,genus,species,assembly]
 - 2798 entries removed with --min-count 5e-05
 - 226 entries reported
 - report saved to ganon_out/species96/species96_9.ganon.tre

Total elapsed time: 138.43 seconds.

As you can see, only 11% of my reads are assigned. The genomes that I am simulating the reads from are about 96% ANI on average compared to the database. Kraken2, on the other hand, gives me > 70% assigned reads

Loading database information... done.
9995339 sequences (2998.60 Mbp) processed in 33.726s (17782.0 Kseq/m, 5334.59 Mbp/m).
  7146061 sequences classified (71.49%)
  2849278 sequences unclassified (28.51%)

My questions:

Is this expected behavior?
Do you think I could be building the database incorrectly or something like that?

For reference, here is my --input-file for the custom build command:

kraken_genomes/RS_GCF_000970205.1.fa    genomes_0   23088
kraken_genomes/RS_GCF_000012285.1.fa    genomes_1   31455
kraken_genomes/GB_GCA_003162175.1.fa    genomes_2   14742
kraken_genomes/RS_GCF_000022385.1.fa    genomes_3   31457
kraken_genomes/GB_GCA_900083515.1.fa    genomes_4   16898
kraken_genomes/GB_GCA_002498385.1.fa    genomes_5   22923
kraken_genomes/GB_GCA_003139855.1.fa    genomes_6   14740
kraken_genomes/GB_GCA_002498845.1.fa    genomes_7   34310
kraken_genomes/GB_GCA_002499445.1.fa    genomes_8   23094
kraken_genomes/RS_GCF_000016525.1.fa    genomes_9   22936

And I'm using a custom nodes.dmp names.dmp generated for my GTDB-R89 database with the following build command:

ganon build-custom --input-file ganon-input.tsv --taxonomy-files nodes.dmp names.dmp --db-prefix r89-ganon --threads 60

pirovc commented 1 year ago

Glad you enjoy using ganon :) the default parameters prioritize precision over sensitivity, since it provided the best overall results for taxonomic profiling and abundance estimation in our benchmark datasets - that's why the lower number of matched reads.

I'm not sure what would be the exact value to match kraken2 results, but to increase number of matching reads, you'll need to increase the --rel-cutoff value in ganon classify. --rel-cutoff 0.25 is a good start, but you can get a better understanding of its effects in the documentation.

If you still have some issues or something is unclear, let me know.

bluenote-1577 commented 1 year ago

As long as it's just a parameter choice issue rather than a software issue, that sounds good. I'll play around with parameter settings a bit.

Thank you!

bluenote-1577 commented 1 year ago

Hi, I'm reopening this issue because I found another weird result in my output:

species 12559   1||110|282|619|1732|4254|12559  Acidovorax_D sp002754495    0   0   12207   12207   10.42414
species 32688   1||148|304|1017|3033|9703|32688 UBA1067 sp002449695 0   0   750 750 0.82632
species 31596   1||38|225|1105|1871|9320|31596  Synechococcus_C sp002698505 0   0   225 225 0.19444
species 31588   1||38|225|1105|1871|9320|31588  Synechococcus_C sp002171995 0   0   85  85  0.07345
species 12631   1||110|282|997|2493|4257|12631  Acinetobacter sp000369565   0   0   1351    1351    0.06641
species 32680   1||148|304|1017|3033|9703|32680 UBA1067 sp002351765 0   0   57  57  0.06280
species 14378   1||17|189|596|1690|4751|14378   Bacteroides_B massiliensis  0   0   58447   58447   0.05849
species 14381   1||17|189|596|1690|4751|14381   Bacteroides_B vulgatus  0   0   51146   51146   0.05118

As you can see, the Acidovorax_D species has 10.424 abundance, whereas the Bacteroides_B massiliensis has 0.05849 abundance. But it appears the second column means that 12207 "assignments" is designed to Acidovorax_D and 58447 to Bacteroides_D...

I checked and the genome sizes for these two genomes and they shouldn't be that different, so I'm not sure what is going on... It seems something suspicious is happening?

pirovc commented 1 year ago

ganon uses genome sizes from the database .tax file, did you check there (5th column)? That indeed should explain the difference in abundances. To test if that's actually the case you could also re-run ganon report with --report-type dist option, that would skip the genome size correction.

You could also send me the .tax from your database and the .rep from your run so I can further investigate.

bluenote-1577 commented 1 year ago

I'm looking at the .tax file right now, and you're right, the genome sizes differ greatly from my fasta files. Some of them are reporting 200 megabases, even though I'm using bacterial genomes. I can fix this myself... but any reason why this might be?

pirovc commented 1 year ago

the genome sizes are calculated during the ganon build process, more infos here. Are you using standard NCBI taxonomic ids in your custom taxonomy nodes.dmp and names.dmp files?

bluenote-1577 commented 1 year ago

I'm using non-standard nodes.dmp and names.dmp files. These are obtained from here https://github.com/rrwick/Metagenomics-Index-Correction. I think these have arbitrary taxids so if ganon used them for genome sizes, I can see why that'd be an issue.

Thanks for your help. I'll just manually change the .tax file myself and hopefully it'll work out... let me know if there are other pitfalls you can think of to using non-standard nodes.dmp and names.dmp files

pirovc commented 1 year ago

ganon supports GTDB taxonomy natively, so you could build your database using GTDB taxids in the 3rd row of your --input-file (GTDB taxids are species names, e.g. s__Acidovorax sp002754495 instead of your integer taxid) and also set --taxonomy gtdb. That way the genome sizes will be calculated accordingly.

Alternatively you could keep using the NCBI-like taxonomy and provide your own --genome-size-files in the same shape as the NCBI file but with your custom taxids and respective genome sizes.

I can't think of any other pitfall using a custom taxonomy, everything else downstream will be calculated based on the .tax file.

I also recommend using the --reassign option (or ganon reassign procedure), it seems to improve every result for me so far and it will be the default option in the next release.

bluenote-1577 commented 1 year ago

Thanks for the help! I'll close this comment for now since it seems to work OK now

pirovc / ganon

Much lower number of classified reads compared to kraken2 #261

My questions: