lmrodriguezr / nonpareil

Estimate metagenomic coverage and sequence diversity
http://enve-omics.ce.gatech.edu/nonpareil/
Other
42 stars 11 forks source link

Not reproducible results #48

Closed VGalata closed 3 weeks ago

VGalata commented 2 years ago

Hello,

I have a question regarding the reproducibility of the results: I ran nonpareil on the same input using the same command line and got slightly different results for both runs. Is that something to be expected? Do you know what the source of this randomness is and whether the analysis could be made deterministic in the future?

Used version: nonpareil=3.3.3=r341h470a237_0 installed via conda

Thank you in advance!

Best, Valentina

cjfields commented 2 years ago

Coming into this a bit late, but there is a random seed setting as one of the parameters and sampling is mentioned in the documentation, so I think this is both completely expected and possible to make reproducible by setting -r to the same seed between runs:

-r <int> | Random generator seed. By default current time.
VGalata commented 2 years ago

Dear @cjfields,

Could you clarify how you use the -r option and with which version of the tool? Also, I do not see this option listed when running nonpareil -h - neither in the mentioned version 3.3.3 nor in the latest one (3.3.4).

I tried to run version v3.303 (3.3.3, r341h470a237_0) with the option -r set using the same command two times. The output from the two runs has different md5sums and different content as well - except for the *.npl files which don't contain any relevant output anyway.

Here are the commands I executed:

nonpareil -s some.reads.fq -T kmer -f fastq -r 23 -b test.1
nonpareil -s some.reads.fq -T kmer -f fastq -r 23 -b test.2
lmrodriguezr commented 2 years ago

Thanks for bringing this up to our attention! I have now implemented consistency with -r when using -T alignment. Note that it may still produce slightly different results with different numbers of threads (-t).

For -T kmer, we use an implementation of random_device, so it needs a little more work.

@gunturus Do you think the kmer kernel could be migrated to a deterministic implementation instead?

cjfields commented 2 years ago

Dear @cjfields,

Could you clarify how you use the -r option and with which version of the tool? Also, I do not see this option listed when running nonpareil -h - neither in the mentioned version 3.3.3 nor in the latest one (3.3.4).

I tried to run version v3.303 (3.3.3, r341h470a237_0) with the option -r set using the same command two times. The output from the two runs has different md5sums and different content as well - except for the *.npl files which don't contain any relevant output anyway.

Here are the commands I executed:

nonpareil -s some.reads.fq -T kmer -f fastq -r 23 -b test.1
nonpareil -s some.reads.fq -T kmer -f fastq -r 23 -b test.2

Happy to see @lmrodriguezr 's answer (and agree that it's good you raised it); I planned on replying that this sounds like a definite bug.

VGalata commented 2 years ago

Dear @lmrodriguezr and @cjfields,

Thank you both for looking into this!