I want to sample more of my reads to see if it adjusts MC's results. Considering that it's already so fast (processing all of my data in a minute), I'm willing to splurge and sample more to see if the accuracy improves slightly. For example, in Additional Figure 4 from the paper, the GI Tract has more divergence than its peers at 500k reads sampled. In my runs, it is only sampling 344k reads, and it is totally unclear to me where that number comes from.
On a related note, if -n is not an option controlling sampling, then I'm not quite sure what it is for. If I simply want to limit my input to N entries, I'd use head or awk. Is it the first N, or is it a sampling of the full input?
I want to sample more of my reads to see if it adjusts MC's results. Considering that it's already so fast (processing all of my data in a minute), I'm willing to splurge and sample more to see if the accuracy improves slightly. For example, in Additional Figure 4 from the paper, the GI Tract has more divergence than its peers at 500k reads sampled. In my runs, it is only sampling 344k reads, and it is totally unclear to me where that number comes from.
On a related note, if -n is not an option controlling sampling, then I'm not quite sure what it is for. If I simply want to limit my input to N entries, I'd use head or awk. Is it the first N, or is it a sampling of the full input?
I think that this should be better documented.