MicrobeCensus is not using all the reads when set to a large number

jolespin commented 4 years ago

I have a test set of 1 million reads:

(microbecensus_env) -bash-4.1$ grep -c "^>" test.fasta
10 000 000

I ran the following command:

(microbecensus_env) -bash-4.1$ run_microbe_census.py -n 100000000 -t 4 ./test.fasta testing/microbeconsensus.txt

Here are the results. There are fewer reads than the initial input. (10000000 - 7576069 = 2423931):

(microbecensus_env) -bash-4.1$ cat testing/microbeconsensus.txt
Parameters
metagenome: ./test.fasta
reads_sampled:  7576069
trimmed_length: 200
min_quality:    -5
mean_quality:   -5
filter_dups:    False
max_unknown:    100

Results
average_genome_size:    9730897.86879
total_bases:    1960038815
genome_equivalents: 201.424251023

Do you know what could be happening here? Is it possible to set the value to -1 or something to not subsample at all? Also, if subsampling is done is it possible to add a seed argument so we can get reproducible results?

snayfach commented 4 years ago

My guess here is that not all your reads are at least 200 bp long. The program by default trims reads to this length (average read length of your sample) and discards reads shorter than this length. If you want to use all the reads in your sample, you need to set the read length so that it is smaller than the length of your shortest read, though this is not recommended.

The reads are not randomly sampled to answer your second question. They are sampled in the order they appear in the input file.

jolespin commented 4 years ago

Thanks! That makes a lot of sense. I have a script where I preprocess the reads first so this is a great option. Also, thanks again for creating this tool. I tried reproducing some other methods from some old papers and it was a nightmare. The fact that you packaged this up and made accessible was a life saver for that project.

snayfach commented 4 years ago

Glad to hear it! Methods papers should clearly go beyond just describing a method :)

snayfach / MicrobeCensus

MicrobeCensus is not using all the reads when set to a large number #27