Consider default parameters for profilers

Midnighter commented 1 year ago

Description of feature

I don't have a lot of experience with all the tools so I can only speak for kraken2 and Bracken:

kraken2

It can be disastrous to not set the --confidence parameter, see this discussion and other issues on the repo. So my vote is to use a rather strict value.
A newly introduced paramter --minimum-hit-groups should also be set, see here.

Bracken

Bracken has the parameter -t which sets the minimum number of reads that kraken2 must have assigned for the taxon to be considered in the redistribution of reads. This is set to 10 by default which is well within the typical numbers of reads that I see as false positives assigned by kraken2.

Please add your points of view and expand for other tools where you have experience.

jfy133 commented 1 year ago

For MALT (the only one I really have much experience wiht), which is presuambly only going to be used for by people in aDNA I would use the settings here:

The maximal E-value was set to 1.0. The maximal number of alignments for each query was set to 100. The minimal percent identity was set to 85. The number of threads was set to 32. The alignment type of MALT was set to Local in order to be comparable to the other programs.

But possibly increase % identity to 90

https://www.biorxiv.org/content/10.1101/050559v1.full

sofstam commented 1 year ago

I do not have experience with KrakenUniq but I came across this discussion in their github repo

https://github.com/fbreitwieser/krakenuniq/issues/112

Midnighter commented 1 year ago

Just saw this in our config:

    shortread_qc_minlength           = 15

Considering that the k-mer-based profilers use a k-mer length of around 35 by default, this is way too short. Maybe a default of 45 or so?

jfy133 commented 1 year ago

I'm not sure about this, some of the tools are alignment and which could still be valid (I think I based 15 off the default of one of the tools... but I can't remember unfortunately now...)

sofstam commented 1 year ago

Regarding kaiju:

https://github.com/bioinformatics-centre/kaiju/issues/209#issuecomment-1020019707

Midnighter commented 1 year ago

The conclusion from your link to Kaiju would be to run it in non-greedy mode? Or at least play around with that to see what difference it makes.

sofstam commented 1 year ago

I was thinking of playing around first and if we should consider a default parameter for kaiju.

LilyAnderssonLee commented 1 year ago

Description of feature

I don't have a lot of experience with all the tools so I can only speak for kraken2 and Bracken:

kraken2

It can be disastrous to not set the --confidence parameter, see this discussion and other issues on the repo. So my vote is to use a rather strict value.

A newly introduced paramter --minimum-hit-groups should also be set, see here.

Bracken

Bracken has the parameter -t which sets the minimum number of reads that kraken2 must have assigned for the taxon to be considered in the redistribution of reads. This is set to 10 by default which is well within the typical numbers of reads that I see as false positives assigned by kraken2.

Please add your points of view and expand for other tools where you have experience.

For kraken2

Confidence

I have tested kraken2 with a set of confidence values (0, 0.1, 0.3, 0.5, 0.7, 0.9) for a validated dataset in which we know the true number of reads of the target virus.

The results indicate confidence=0.1 would be a best choice for samples with low concentration; confidence>0.1 usually fails to identify targeted virus; confidence=0 also is not so bad at performing identification but with higher false positives.

For viruses with high concentration, confidence 0, 0.1, 0.3, 0.5 even 0.7 performs quite well. Lower confidence assigned more reads to targeting viruses, but still lower than the true value.

So I would like to go for confidence=0.1 if the research question is just to identify certain viruses.

--minimum-hit-groups

The default value is 2 in kraken.v 2.1.2.

This value could vary quite a lit based on research questions. If you have a highly diverse dataset or are interested in detecting rare taxa, you may want to use a lower value for --minimum-hit-groups to increase sensitivity. On the other hand, if you have a relatively simple dataset or are interested in detecting only highly abundant taxa, you may want to choose a higher value to increase specificity.

I think default=2 is a good option in general.

sofstam commented 1 year ago

From slack: From defaults to databases: parameter and database choice dramatically impact the performance of metagenomic taxonomic classification tools https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000949

Midnighter commented 1 year ago

I'm not quite sure about the minimum hit groups right now but I just stumbled across an example read assignment that makes me even more sure that a low confidence value is a bad idea.

C   A01136:446:H7J2YDSX5:4:2362:26078:14246 1173    101|101 18:46 22:8 18:7 22:5 1173:1 |:| 0:13 18:5 0:6 18:43

You can see here that a single k-mer is assigned to taxon 1173 whereas the majority is clearly taxon 18.

LilyAnderssonLee commented 1 year ago

I agree that --minlength = 15 is too short. A value of around 30-35 could be a good consideration. It was set to 50 for Illumina data in nf-core/viralrecon. There are also a few cases which used 30. For instance: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1927-y https://www.diva-portal.org/smash/get/diva2:1607219/FULLTEXT01.pdf

Midnighter commented 1 year ago

I'm somewhat confused by your comment since --minlength is yet another option that we didn't discuss so far.

LilyAnderssonLee commented 1 year ago

I apologize for any confusion. In your previous comments, you mentioned the parameter shortread_qc_minlength = 15, which is used in both Fastp and AdapterRemoval to specify the minimum length of reads to be retained after quality control filtering. I have read some articles about how this value is chosen.

Midnighter commented 1 year ago

Okay, now I see which comment you're referring to 🙂

nf-core / taxprofiler