Length normalized metagenomic data

Thanks for your work on this promising tool! In regards to its applications to metagenomics, can radEmu handle length normalized data as input? In metagenomics, it is common to normalize counts to the length of the genome or gene (reads per kilobase, etc), as well as sequencing depth but it seems like radEmu handles the depth obstacle. I see examples of using radEmu but I think they use integer count data (typical for 16S amplicon workflows). Can radEMU handle length normalized data? Anything to worry about here as a user?

For instance, I am interested in using radEmu with the output of singleM (https://www.biorxiv.org/content/10.1101/2024.01.30.578060v1.full), a newer tool for profiling species in metagenomic data.

"singleM uses OTU coverage, defined as coverage = (nL) / (L - k + 1)

The ’coverage’ of each OTU is calculated using the established relationship between kmer coverage and read coverage as set out by Velvet

Where n is the number of reads with the OTU sequence, L is the length of the read and k is the length of the OTU sequence including inserts but excluding gaps (usually 60 bp). In practice, each read may have a different length and/or aligned length within the 20 amino acids, so the coverage contribution of each read is calculated separately according to the formula above. The coverage assigned to an OTU is the sum of each read’s contribution."

Thanks! Chris

Hi @anderson-fsep ! Great questions!

Actually our example in the paper isn't 16S -- they're depths/coverages (average number of reads per site) from shotgun data.
Length normalized data is totally fine.
Note that since you're estimating fold changes in abundance, it doesn't matter if those are fold changes in length-normalized abundance, because the numerator and denominator both have the same length. (ie, your comparisons are for the same species across covariates, so multiplying abundances within a gene by a constant factor doesn't change what you're estimating*)

Note that in practice it may change the estimates slightly* (because of the weighting coming in through the estimating equations that are being solved), but this is not a problem IMO

I hope that helps! Feel free to reopen or open a new issue if I didn't fully answer your question.

statdivlab / radEmu

Length normalized metagenomic data #55