Closed nick-youngblut closed 4 years ago
StrainPro is more accurate when the read depth is deep enough. The default sensitivity is at least 20X. StrainPro only reports a taxid if its read depth is above the threshold. You may try to use "-depth" to specify the read depth. For example, if you run StrainPro with "-depth 10", StrainPro will be more sensitive to low coverage data. Another option, "-freq" is used to set the minimal read count for reporting a candidate taxid (the default is 50). The benchmark data sets we used in our manuscript were generated with 50X of read coverage.
I forgot to put the two options in the help message. I've updated a new version. I also corrected the typo. Thank you very much!
The reason that you did not get any hits at the kingdom and phylum level when using 0.5 mil read sample is that the read counts are based on the subspecies or species level. If StrainPro cannot identify any taxid at subspecies/species, it will not output any result. StrainPro was designed for identifying taxids at subspecies/species level.
Thanks for the explanation! So in the StrainPro biorxiv paper, the recall is ~1 for all simulated datasets. I guess that means that all taxa in each simulation had >= 20X coverage, correct? I might have missed it in the paper: is their a plot/table showing StrainPro precision/recall as a function of coverage, including coverages below 20X?
Yes, all taxa in each simulation had 50X coverage. Sorry, we did not evaluate the performance in terms of read coverage. We'll generate simulated data with different depths and evaluate the precision/recall accordingly. Thank you for the suggestion.
@nick-youngblut I generated another 4 datasets with different read depths. For dataset with N-fold, I ran GSAlign with arguments of "-depth N -freq N". Then I evaluated the performance of each dataset. Here is the result. The performance on each dataset is all identical.
Coverage | Subspecies_Precision | Subspecies_Recall | Species_Precision | Species_Recall | Genus_Precision | Genus_Recall |
---|---|---|---|---|---|---|
10 | 0.9756 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
20 | 0.9756 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
30 | 0.9756 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
40 | 0.9756 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
50 | 0.9756 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
That's great! Thanks for running the test! Any idea where accuracy drops off and how quickly? For instance, does accuracy drop sharply at coverages below 10X or is accuracy still OK at 5X or even lower? I'm interested in the abundances of relatively rare species in human gut metagenomes, so it will be hard to get even 10X coverage.
Since StrainPro identifies taxa based on the mapping density in each representative sequence, it is very efficient to ignore false alignments. Suppose G1 is present and G2 is absent, then we can expect to see a certain number of reads will be mapped to each representative sequence of G1. And we can expect to see only a few reads mapped to representative sequences of G2.
True alignments:
False alignments:
@nick-youngblut I generated a dataset of depth 5X and ran StrainPro again. Its performance is identical to those on other datasets. However, I ran StrainPro with "-depth 5 -freq 5" on the dataset of depth 50X. The performance is shown below. It implies that if you use a low threshold for depth and read count on a dataset of deep depth, the precision on subspecies level detection would be not as good as high threshold.
Coverage | Subspecies_Precision | Subspecies_Recall | Species_Precision | Species_Recall | Genus_Precision | Genus_Recall |
---|---|---|---|---|---|---|
50X (-depth 5 -freq 5) | 0.6452 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
I forgot to mention that the database I used is all bacterial genomes in NCBI RefSeq database, and the testset is SimData_1 in our manuscript.
Thanks for running the additional test! So sub-species accuracy declines between 5-10X? What about the accuracy in estimating abundances (like the last table in your biorxiv paper)?
The taxonomic prediction is more reliable if the read coverage is higher. The previous experiment suggests that if you lower the threshold of minimal read depth and read count, it would probably introduce more false predictions. Regarding the estimating abundances, we have shown StrainPro is able to produce better estimation than existing tools since we estimate the abundances based on the mapping density rather than read counts.
It is noteworthy that we did not test StrainPro on any real datasets at strain level since we could not find any real datasets with highly confident annotation. We did test on some real datasets at phylum level and the prediction is identical to other method's. If you have suitable datasets, please let me know. Thank you!
Thanks for clarifying. We have previously run krakenuniq
on some simulated datasets containing varying levels of intra-species diversity. The main goal of these simulations was to determine whether krakenuniq can accurately predict genome-level abundances for all genomes in a species/genus (eg., all pseudomonas genomes). Our simulations suggest that krakenuniq can predict genome-level abundances with pretty good accuracy, but it depends on the amount of intra-species diversity. We will compare StrainPro
to krakenuniq
.
In regards to the taxIDs, StrainPro
can handle duplicate taxIDs in the reference dataset, correct? We can only provide taxIDs for our genomes at the species level, so multiple genomes have the same taxIDs. krakenuniq
deals with this issue by creating pseudo-taxIDs.
Yes, StrainPro can handle duplicate taxIDs in the reference dataset. However, they will be considered as one entity. The abundance estimation at species level or above is based on the read count instead of mapping density.
So if we want genome-level abundances, then we would have to create pseudo-taxIDs and add them to the taxdump?
If two or more genomes belong to the same species in the sample community and are labelled with the same taxID, their abundance levels cannot be estimated separately since they are measured by taxIDs. Technically StrainPro can estimate genome-level abundances if we consider sequence ID rather than taxID. If you can create pseudo-taxIDs for each genome, then StrainPro can produce genome-level abundances. However, if the database is modified, it is necessary to re-build representative sequences.
Will StrainPro then ignore those pseudo-taxIDs when aggregating abundances at the species, genus, ..., kingdom levels?
When estimating abundances at species-level or above, we use all read counts that belong to its sub-ranks. Since all pseudo-taxIDs are considered to be at subspecies-level, their abundances will only appear at subspecies-level.
Since all pseudo-taxIDs are considered to be at subspecies-level, their abundances will only appear at subspecies-level.
Why not aggregate sub-species to the species level (and courser levels)?
Given the lack of support for genome-level abundances, I'm currently inclined to think that krakenuniq is more flexible than StrainPro.
StrainPro predicts all-level abundances. Since pseudo-taxIDs are only assigned to sub-species, the abundance for pseudo-taxIDs will combine together to estimate the abundance of upper-levels.
I just mapped 0.5mil 150bp HiSeq reads from a human gut metagenome (previously profiled with kraken2; looked very normal for the sample time) to RefSeq-bacteria, and I'm getting the following output:
Steps to reproduce
If I map the reads with kraken2 versus refseq, then I get a taxonomic distribution that looks normal. I tried mapping another sample with 10 mil reads instead of the 0.5 mil reads, and then I did get some output:
For the 0.5 mil read sample, I'm wondering why I didn't at least get some hits at the kingdom and phylum level. Even for this 10mil read file, I'm only getting 4 phyla. Is StrainPro just not very sensitive?
FYI:
@TaxRank:calss
is spelled wrong