Closed bluenote-1577 closed 1 year ago
Glad you enjoy using ganon :) the default parameters prioritize precision over sensitivity, since it provided the best overall results for taxonomic profiling and abundance estimation in our benchmark datasets - that's why the lower number of matched reads.
I'm not sure what would be the exact value to match kraken2 results, but to increase number of matching reads, you'll need to increase the --rel-cutoff
value in ganon classify
. --rel-cutoff 0.25
is a good start, but you can get a better understanding of its effects in the documentation.
If you still have some issues or something is unclear, let me know.
As long as it's just a parameter choice issue rather than a software issue, that sounds good. I'll play around with parameter settings a bit.
Thank you!
Hi, I'm reopening this issue because I found another weird result in my output:
species 12559 1||110|282|619|1732|4254|12559 Acidovorax_D sp002754495 0 0 12207 12207 10.42414
species 32688 1||148|304|1017|3033|9703|32688 UBA1067 sp002449695 0 0 750 750 0.82632
species 31596 1||38|225|1105|1871|9320|31596 Synechococcus_C sp002698505 0 0 225 225 0.19444
species 31588 1||38|225|1105|1871|9320|31588 Synechococcus_C sp002171995 0 0 85 85 0.07345
species 12631 1||110|282|997|2493|4257|12631 Acinetobacter sp000369565 0 0 1351 1351 0.06641
species 32680 1||148|304|1017|3033|9703|32680 UBA1067 sp002351765 0 0 57 57 0.06280
species 14378 1||17|189|596|1690|4751|14378 Bacteroides_B massiliensis 0 0 58447 58447 0.05849
species 14381 1||17|189|596|1690|4751|14381 Bacteroides_B vulgatus 0 0 51146 51146 0.05118
As you can see, the Acidovorax_D species has 10.424 abundance, whereas the Bacteroides_B massiliensis has 0.05849 abundance. But it appears the second column means that 12207 "assignments" is designed to Acidovorax_D and 58447 to Bacteroides_D...
I checked and the genome sizes for these two genomes and they shouldn't be that different, so I'm not sure what is going on... It seems something suspicious is happening?
ganon uses genome sizes from the database .tax
file, did you check there (5th column)? That indeed should explain the difference in abundances. To test if that's actually the case you could also re-run ganon report
with --report-type dist
option, that would skip the genome size correction.
You could also send me the .tax
from your database and the .rep
from your run so I can further investigate.
I'm looking at the .tax file right now, and you're right, the genome sizes differ greatly from my fasta files. Some of them are reporting 200 megabases, even though I'm using bacterial genomes. I can fix this myself... but any reason why this might be?
the genome sizes are calculated during the ganon build
process, more infos here. Are you using standard NCBI taxonomic ids in your custom taxonomy nodes.dmp
and names.dmp
files?
I'm using non-standard nodes.dmp and names.dmp files. These are obtained from here https://github.com/rrwick/Metagenomics-Index-Correction. I think these have arbitrary taxids so if ganon used them for genome sizes, I can see why that'd be an issue.
Thanks for your help. I'll just manually change the .tax file myself and hopefully it'll work out... let me know if there are other pitfalls you can think of to using non-standard nodes.dmp and names.dmp files
ganon supports GTDB taxonomy natively, so you could build your database using GTDB taxids in the 3rd row of your --input-file
(GTDB taxids are species names, e.g. s__Acidovorax sp002754495 instead of your integer taxid) and also set --taxonomy gtdb
. That way the genome sizes will be calculated accordingly.
Alternatively you could keep using the NCBI-like taxonomy and provide your own --genome-size-files
in the same shape as the NCBI file but with your custom taxids and respective genome sizes.
I can't think of any other pitfall using a custom taxonomy, everything else downstream will be calculated based on the .tax
file.
I also recommend using the --reassign
option (or ganon reassign
procedure), it seems to improve every result for me so far and it will be the default option in the next release.
Thanks for the help! I'll close this comment for now since it seems to work OK now
Hi there,
Thanks for building this software. I'm quite enjoying using it.
I had a question about benchmarking ganon. I am currently building a custom version of the GTDB-R89 database with ganon, and running simulated paired reads against this database. I get the following log
As you can see, only 11% of my reads are assigned. The genomes that I am simulating the reads from are about 96% ANI on average compared to the database. Kraken2, on the other hand, gives me > 70% assigned reads
My questions:
For reference, here is my --input-file for the custom build command:
And I'm using a custom nodes.dmp names.dmp generated for my GTDB-R89 database with the following build command:
ganon build-custom --input-file ganon-input.tsv --taxonomy-files nodes.dmp names.dmp --db-prefix r89-ganon --threads 60