Closed jolespin closed 4 years ago
My guess here is that not all your reads are at least 200 bp long. The program by default trims reads to this length (average read length of your sample) and discards reads shorter than this length. If you want to use all the reads in your sample, you need to set the read length so that it is smaller than the length of your shortest read, though this is not recommended.
The reads are not randomly sampled to answer your second question. They are sampled in the order they appear in the input file.
Thanks! That makes a lot of sense. I have a script where I preprocess the reads first so this is a great option. Also, thanks again for creating this tool. I tried reproducing some other methods from some old papers and it was a nightmare. The fact that you packaged this up and made accessible was a life saver for that project.
Glad to hear it! Methods papers should clearly go beyond just describing a method :)
I have a test set of 1 million reads:
I ran the following command:
(microbecensus_env) -bash-4.1$ run_microbe_census.py -n 100000000 -t 4 ./test.fasta testing/microbeconsensus.txt
Here are the results. There are fewer reads than the initial input. (10000000 - 7576069 = 2423931):
Do you know what could be happening here? Is it possible to set the value to
-1
or something to not subsample at all? Also, if subsampling is done is it possible to add aseed
argument so we can get reproducible results?