vrmarcelino / CCMetagen

Microbiome classification pipeline
GNU General Public License v3.0
64 stars 19 forks source link

Underestimation possibility of abundance while using db containing subsequences? #23

Closed mihkelvaher closed 3 years ago

mihkelvaher commented 3 years ago

Hi!

This is a follow-up to the abundance calculation question (#16).

1. If the abundance of a taxon is calculated by summing all aligned nucleotides from the corresponding reads divided by the sum of the templates AND the templates overlap, is the overall abundance underestimated since the sum of the templates could, in theory, be even longer than the genome?

I think the question is relevant when using the 'nt' database. For example if we take Mus musculus - it has both the whole chromosomes and the separate genes in the database. The reads can be mapped on a chromosome OR on a gene that is a subsequence of the chromosome and by summing their lengths be get more than the initial chromosome size. Is my reasoning correct or is there something to mitigate this effect?

A cumbersome solution would be to look at the database one species at a time and see if any template is (almost wholly) a subsequence of another. I'm hoping that I missed something and it's not really needed.

2. In my case, I'm trying to find the organism abundance in the sample i.e. relative number of organisms not anything related to read counts or genome size (== the default abundance reported?). Moreover, in addition to the previous problem described, there are many cases where some species have whole genomes whereas some only have a few genes. Aside from more unclassified reads, could the abundance estimations be askew towards one or the other species?

3. Would it make sense all of the abundancies in the output .tsv file (input for ktImportText) and divide them by their total sum * 100% so that the units for "Total" and "Unassigned" in the Krona plots would be "% of the number of all organisms in the sample"? I'm having a bit of a hard time finding useful interpretation for the default "depth" since it's dependent of the sequencing depth.

Sorry for the wall of text, but I'm a bit excited to use it because it's really hard to get any abundance numbers from competing programs other than read counts. Short answers are fine.

Thanks, Mihkel

vrmarcelino commented 3 years ago

Hi Mihkel,

You are right – correcting for gene length can bias the estimate of species abundance.

As you suspected, this is especially problematic when, for example, species A has a complete genome in the NT database, while species B has only a single gene. The default abundance calculation is: number of bases overlapping the reference / length of the reference. So, if we have 1000 reads matching genome A and 10 reads matching gene B – the abundance of A will appear to be a lot lower than B (because the reference template for A is many orders of magnitude larger than the reference template for B).

If we do the analyses using read abundance as a proxy for taxonomic abundance (with the flag --depth_unit = fr), then B will have a lower abundance than A – which is also wrong. I think this paper covers some of these issues: https://www.nature.com/articles/s41592-021-01141-3 Running CCMetagen with these different depth-unit settings helps to detect suspicious abundances.

One way of mitigating this bias is by building a database that contains only complete genomes, or a has a normalised number/size of reference sequences. RefSeq is likely better than the nt db if you really care about relative abundances.

Another possibility may be to use a single-copy gene as control, and normalize the read counts based on how many reads map to this single-copy gene.

Not sure I fully understand what you want to do – do you want to get the relative abundance of taxa, or the % of taxa (e.g. species "A" represents 10% of the diversity in these samples, regardless of its abundance)?

If it is the earlier – the krona plots should already give you relative abundances. If you want something more programmatic – I would use the csv file to calculate the relative abundances (in R for example).

mihkelvaher commented 3 years ago

Thank you for the quick response!

The paper nicely summarizes my frustration about most of the existing tools.

Failure to distinguish between read and taxonomic abundance is probably the reason that LEMMI (https://lemmi.ezlab.org/#/ https://dx.doi.org/10.1101%2Fgr.260398.119) scores CCMetagen poorly?

As you suspected, this is especially problematic when, for example, species A has a complete genome in the NT database, while species B has only a single gene. The default abundance calculation is: number of bases overlapping the reference / length of the reference. So, if we have 1000 reads matching genome A and 10 reads matching gene B – the abundance of A will appear to be a lot lower than B (because the reference template for A is many orders of magnitude larger than the reference template for B). Taxonomic abundance should still work if the reads land on the whole genome and on the single gene randomly and both are covered with reads uniformly. The estimations are probably more erratic for the gene with lower sequencing depth because sequencing might or might not hit the gene that is in the database.

Reading the paper and your suggestion, single-copy genes are the best for taxonomic abundance estimation but they are not an option in my case since I'm performing a wide search across all kingdoms with an extended not database.

The overall picture clearer now and I'm moving on with the taxonomic abundance that CCMetagen provides while keeping an eye out for issues generated from overlapping templates. Hopefully filtering out redundant subsequences is not needed.

Thanks!

vrmarcelino commented 3 years ago

Hi!

I believe this issue affects all taxonomic classifiers (if the same database is taken into consideration).

CCMetagen only performed poorly in the LEMMI paper when using different databases (NCBI nt for CCMetagen, while others often use RefSeq by default). The cool thing about the LEMMI platform is that you can change the settings and see how the rankings change. If you select 'Evaluation using an identical reference', include low abundance taxa and give precision more importance (which is what we were looking for when designing our program), the CCMetagen ranking is actually pretty good :)

Sounds like a good plan. Another thing to keep an eye on is the number of unclassified sequences (which you can get using the flag "-ef" in kma, and then "-ef y" in ccmetagen.py). In the more unknown and complex habitats, CCMetagen only manages to classify a tiny proportion of reads. If that is the case, then the relative taxa abundance is not representative of the community and CCMetagen is probably not the way to go. I'd be keen to hear potential ways forward.

mihkelvaher commented 3 years ago

Unclassified reads are definitely something I'm constantly monitoring and I was a bit disappointed when they didn't show up on Krona. Then again, showing them makes sense only in sequence abundance context so best to keep it as a separate quality metric. I'll probably extract and analyse them further (should be possible with CCMetagen_extract_seqs.py?)

Isn't taxonomic abundance still OK for complex samples if you report them as "Of the identified organisms there is 24% of species X"?

vrmarcelino commented 3 years ago

That is right, we can't incorporate them in the default krona plots because we don't know the 'unknown' genome sizes. Yes, that is how we usually report them. And in many cases, the identified organisms are a good proxy for the real community composition - we get plenty of unclassified reads simply because we don't have full genomes for most taxa yet.