lmrodriguezr / nonpareil

Estimate metagenomic coverage and sequence diversity
http://enve-omics.ce.gatech.edu/nonpareil/
Other
42 stars 11 forks source link

Using np to target subsampling #46

Closed andrewjmc closed 2 years ago

andrewjmc commented 3 years ago

Hello,

This is a question rather than an issue, apologies if this is a bad place to post.

Thanks for such a useful tool. I'd be interested in your views on using Nonpareil curves to guide subsampling.

Let's say we want to coassemble a large number of samples, and the size exceeds what can be done within resource limitations. One solution is to randomly subsample a proportion of the reads from each sample. At the expense of less information, and less depth for minority organisms, this may allow the assembly to proceed.

However, it may be that some samples are more informative for assembly than others: e.g. it may be better to more aggressively subsample a high-coverage and low diversity sample.

I am considering trying an approach whereby for each sample I estimate the total bp required to achieve (say) 0.95 coverage from each sample. Those with proportion >= 1 are not subsampled, and those with proportion < 1 are subsampled to the required proportion of reads.

In my head, it feels like this might provide an efficient way of subsampling for assembly (although I appreciate it will not save as much memory, since it will preserve more unique kmers). But it would be great to sense check the idea!

All thoughts welcome!

Thanks,

Andrew

lmrodriguezr commented 2 years ago

Hello Andrew,

I'm very glad Nonpareil is working for you!

Yes, I think the approach you describe could work, but please note that this would only address "higher-boundary" issues. More often than not, subsampling is useful to target the most abundant populations (not the entire community, for which more data is often better). If you have a target (abundant) population and you know its relative abundance, a better approach might be to subsample as much as necessary to reach around 10X for that target population. However, in the absence of that information, I think your approach would work well, but I would target a coverage much lower than 95% (e.g., 60%).

I hope this helps! Feel free to reopen the issue if you have any follow-ups

Best wishes Miguel