statdivlab / radEmu

Other
48 stars 8 forks source link

Length normalized metagenomic data #55

Closed anderson-fsep closed 5 months ago

anderson-fsep commented 5 months ago

Thanks for your work on this promising tool! In regards to its applications to metagenomics, can radEmu handle length normalized data as input? In metagenomics, it is common to normalize counts to the length of the genome or gene (reads per kilobase, etc), as well as sequencing depth but it seems like radEmu handles the depth obstacle. I see examples of using radEmu but I think they use integer count data (typical for 16S amplicon workflows). Can radEMU handle length normalized data? Anything to worry about here as a user?

For instance, I am interested in using radEmu with the output of singleM (https://www.biorxiv.org/content/10.1101/2024.01.30.578060v1.full), a newer tool for profiling species in metagenomic data.

"singleM uses OTU coverage, defined as coverage = (nL) / (L - k + 1)

image

The ’coverage’ of each OTU is calculated using the established relationship between kmer coverage and read coverage as set out by Velvet

Where n is the number of reads with the OTU sequence, L is the length of the read and k is the length of the OTU sequence including inserts but excluding gaps (usually 60 bp). In practice, each read may have a different length and/or aligned length within the 20 amino acids, so the coverage contribution of each read is calculated separately according to the formula above. The coverage assigned to an OTU is the sum of each read’s contribution."

Thanks! Chris

adw96 commented 5 months ago

Hi @anderson-fsep ! Great questions!

Note that in practice it may change the estimates slightly* (because of the weighting coming in through the estimating equations that are being solved), but this is not a problem IMO

I hope that helps! Feel free to reopen or open a new issue if I didn't fully answer your question.