hsinnan75 / StrainPro

MIT License
7 stars 3 forks source link

nothing is mapping to RefSeq-bacteria #3

Closed nick-youngblut closed 4 years ago

nick-youngblut commented 4 years ago

I just mapped 0.5mil 150bp HiSeq reads from a human gut metagenome (previously profiled with kraken2; looked very normal for the sample time) to RefSeq-bacteria, and I'm getting the following output:

#TaxID                  #Read_count             #Depth                  #Relative_abundance     #Confidence_score
@TaxRank:subspecies
@TaxRank:species
@TaxRank:genus
@TaxRank:family
@TaxRank:order
@TaxRank:calss
@TaxRank:phylum
@TaxRank:kingdom

Steps to reproduce

git clone https://github.com/hsinnan75/StrainPro.git
cd StrainPro
./download_taxonomy.sh
./download_genomic_library.sh library bacteria  
# WARNING: the following cmd takes many hours to complete (even with the default 16 threads)!
./bin/StrainPro-build -r database/bacteria/ -o database/bacteria/ref_idx
./bin/StrainPro-map -i database/bacteria/ref_idx -f /path/to/metagenome/read1.fq.gz -o output

If I map the reads with kraken2 versus refseq, then I get a taxonomic distribution that looks normal. I tried mapping another sample with 10 mil reads instead of the 0.5 mil reads, and then I did get some output:

#TaxID                  #Read_count             #Depth                  #Relative_abundance     #Confidence_score
@TaxRank:subspecies
99822                   2395                    41                      7.43                    0.410877
411470                  5785                    28                      5.07                    0.282264
435591                  218                     25                      4.53                    0.255269
479831                  1269                    21                      3.80                    0.211359
499175                  52                      25                      4.53                    0.252427
536231                  965                     28                      5.07                    0.287117
553973                  50                      46                      8.33                    0.462963
679935                  2386                    30                      5.43                    0.297136
...
@TaxRank:phylum
976                     1162258                 35                      89.80                   1.000000
1224                    5506                    92                      0.43                    1.000000
1239                    122790                  38                      9.49                    1.000000
201174                  3682                    30                      0.28                    1.000000
@TaxRank:kingdom
2                       1294236                 36                      100.00                  1.000000

For the 0.5 mil read sample, I'm wondering why I didn't at least get some hits at the kingdom and phylum level. Even for this 10mil read file, I'm only getting 4 phyla. Is StrainPro just not very sensitive?

FYI: @TaxRank:calss is spelled wrong

hsinnan75 commented 4 years ago

StrainPro is more accurate when the read depth is deep enough. The default sensitivity is at least 20X. StrainPro only reports a taxid if its read depth is above the threshold. You may try to use "-depth" to specify the read depth. For example, if you run StrainPro with "-depth 10", StrainPro will be more sensitive to low coverage data. Another option, "-freq" is used to set the minimal read count for reporting a candidate taxid (the default is 50). The benchmark data sets we used in our manuscript were generated with 50X of read coverage.

I forgot to put the two options in the help message. I've updated a new version. I also corrected the typo. Thank you very much!

hsinnan75 commented 4 years ago

The reason that you did not get any hits at the kingdom and phylum level when using 0.5 mil read sample is that the read counts are based on the subspecies or species level. If StrainPro cannot identify any taxid at subspecies/species, it will not output any result. StrainPro was designed for identifying taxids at subspecies/species level.

nick-youngblut commented 4 years ago

Thanks for the explanation! So in the StrainPro biorxiv paper, the recall is ~1 for all simulated datasets. I guess that means that all taxa in each simulation had >= 20X coverage, correct? I might have missed it in the paper: is their a plot/table showing StrainPro precision/recall as a function of coverage, including coverages below 20X?

hsinnan75 commented 4 years ago

Yes, all taxa in each simulation had 50X coverage. Sorry, we did not evaluate the performance in terms of read coverage. We'll generate simulated data with different depths and evaluate the precision/recall accordingly. Thank you for the suggestion.

hsinnan75 commented 4 years ago

@nick-youngblut I generated another 4 datasets with different read depths. For dataset with N-fold, I ran GSAlign with arguments of "-depth N -freq N". Then I evaluated the performance of each dataset. Here is the result. The performance on each dataset is all identical.

Coverage Subspecies_Precision Subspecies_Recall Species_Precision Species_Recall Genus_Precision Genus_Recall
10 0.9756 1.0000 1.0000 1.0000 1.0000 1.0000
20 0.9756 1.0000 1.0000 1.0000 1.0000 1.0000
30 0.9756 1.0000 1.0000 1.0000 1.0000 1.0000
40 0.9756 1.0000 1.0000 1.0000 1.0000 1.0000
50 0.9756 1.0000 1.0000 1.0000 1.0000 1.0000
nick-youngblut commented 4 years ago

That's great! Thanks for running the test! Any idea where accuracy drops off and how quickly? For instance, does accuracy drop sharply at coverages below 10X or is accuracy still OK at 5X or even lower? I'm interested in the abundances of relatively rare species in human gut metagenomes, so it will be hard to get even 10X coverage.

hsinnan75 commented 4 years ago

Since StrainPro identifies taxa based on the mapping density in each representative sequence, it is very efficient to ignore false alignments. Suppose G1 is present and G2 is absent, then we can expect to see a certain number of reads will be mapped to each representative sequence of G1. And we can expect to see only a few reads mapped to representative sequences of G2.

True alignments: image

False alignments: image

hsinnan75 commented 4 years ago

@nick-youngblut I generated a dataset of depth 5X and ran StrainPro again. Its performance is identical to those on other datasets. However, I ran StrainPro with "-depth 5 -freq 5" on the dataset of depth 50X. The performance is shown below. It implies that if you use a low threshold for depth and read count on a dataset of deep depth, the precision on subspecies level detection would be not as good as high threshold.

Coverage Subspecies_Precision Subspecies_Recall Species_Precision Species_Recall Genus_Precision Genus_Recall
50X (-depth 5 -freq 5) 0.6452 1.0000 1.0000 1.0000 1.0000 1.0000
hsinnan75 commented 4 years ago

I forgot to mention that the database I used is all bacterial genomes in NCBI RefSeq database, and the testset is SimData_1 in our manuscript.

nick-youngblut commented 4 years ago

Thanks for running the additional test! So sub-species accuracy declines between 5-10X? What about the accuracy in estimating abundances (like the last table in your biorxiv paper)?

hsinnan75 commented 4 years ago

The taxonomic prediction is more reliable if the read coverage is higher. The previous experiment suggests that if you lower the threshold of minimal read depth and read count, it would probably introduce more false predictions. Regarding the estimating abundances, we have shown StrainPro is able to produce better estimation than existing tools since we estimate the abundances based on the mapping density rather than read counts.

It is noteworthy that we did not test StrainPro on any real datasets at strain level since we could not find any real datasets with highly confident annotation. We did test on some real datasets at phylum level and the prediction is identical to other method's. If you have suitable datasets, please let me know. Thank you!

nick-youngblut commented 4 years ago

Thanks for clarifying. We have previously run krakenuniq on some simulated datasets containing varying levels of intra-species diversity. The main goal of these simulations was to determine whether krakenuniq can accurately predict genome-level abundances for all genomes in a species/genus (eg., all pseudomonas genomes). Our simulations suggest that krakenuniq can predict genome-level abundances with pretty good accuracy, but it depends on the amount of intra-species diversity. We will compare StrainPro to krakenuniq.

In regards to the taxIDs, StrainPro can handle duplicate taxIDs in the reference dataset, correct? We can only provide taxIDs for our genomes at the species level, so multiple genomes have the same taxIDs. krakenuniq deals with this issue by creating pseudo-taxIDs.

hsinnan75 commented 4 years ago

Yes, StrainPro can handle duplicate taxIDs in the reference dataset. However, they will be considered as one entity. The abundance estimation at species level or above is based on the read count instead of mapping density.

nick-youngblut commented 4 years ago

So if we want genome-level abundances, then we would have to create pseudo-taxIDs and add them to the taxdump?

hsinnan75 commented 4 years ago

If two or more genomes belong to the same species in the sample community and are labelled with the same taxID, their abundance levels cannot be estimated separately since they are measured by taxIDs. Technically StrainPro can estimate genome-level abundances if we consider sequence ID rather than taxID. If you can create pseudo-taxIDs for each genome, then StrainPro can produce genome-level abundances. However, if the database is modified, it is necessary to re-build representative sequences.

nick-youngblut commented 4 years ago

Will StrainPro then ignore those pseudo-taxIDs when aggregating abundances at the species, genus, ..., kingdom levels?

hsinnan75 commented 4 years ago

When estimating abundances at species-level or above, we use all read counts that belong to its sub-ranks. Since all pseudo-taxIDs are considered to be at subspecies-level, their abundances will only appear at subspecies-level.

nick-youngblut commented 4 years ago

Since all pseudo-taxIDs are considered to be at subspecies-level, their abundances will only appear at subspecies-level.

Why not aggregate sub-species to the species level (and courser levels)?

Given the lack of support for genome-level abundances, I'm currently inclined to think that krakenuniq is more flexible than StrainPro.

hsinnan75 commented 4 years ago

StrainPro predicts all-level abundances. Since pseudo-taxIDs are only assigned to sub-species, the abundance for pseudo-taxIDs will combine together to estimate the abundance of upper-levels.