snayfach / MicrobeCensus

MicrobeCensus estimates the average genome size of microbial communities from metagenomic data
http://genomebiology.com/2015/16/1/51
GNU General Public License v3.0
41 stars 16 forks source link

Feature Request: command line parameter to control subsampling #7

Closed taltman closed 9 years ago

taltman commented 9 years ago

I want to sample more of my reads to see if it adjusts MC's results. Considering that it's already so fast (processing all of my data in a minute), I'm willing to splurge and sample more to see if the accuracy improves slightly. For example, in Additional Figure 4 from the paper, the GI Tract has more divergence than its peers at 500k reads sampled. In my runs, it is only sampling 344k reads, and it is totally unclear to me where that number comes from.

On a related note, if -n is not an option controlling sampling, then I'm not quite sure what it is for. If I simply want to limit my input to N entries, I'd use head or awk. Is it the first N, or is it a sampling of the full input?

I think that this should be better documented.

taltman commented 9 years ago

Nevermind, I think I misread the output in an earlier run. The output & sampling using '-n' makes sense.