Closed sfrenk closed 5 years ago
I looked at this issue some time ago when I ported the assembler.cpp code to Java for no reason but I could not replicate the .cpp results exactly because I would have had to duplicate the .cpp iterator order of some data structures I didn't want to port, you might look at the maxContigs = 5000; constant in NativeAssembler.java
I've run some quick tests and do see some variability in the assembledContigCount, but not the nonAssembledContigCount
Are you seeing variability in both?
I've run some quick tests and do see some variability in the assembledContigCount, but not the nonAssembledContigCount
Are you seeing variability in both?
Still see some variation, albeit not as much as in assembledContigCount
OK, I'll see what I can track down. However, this may take some time.
Hi @mozack , thanks for looking into this. To help with the investigation, I've put together a reproducible example of the variability issues we've been having:
https://bitbucket.org/achillestx/abra-reproducibility/src/master/
Thanks for sharing this and apologies for the delay. I'll start digging into this in earnest shortly.
This should be fixed in version 2.20. Please let me know if you continue to see problems.
Thanks a lot for fixing this - it definitely seems to be working now. I just ran the example 50 times and got identical results each time.
My team recently performed two separate identical runs of abra2 (version 2.15) on copies of the same set of BAM files. We expected the resulting processed BAMs to be identical in size between the two sets. However, they ended up having slightly differing file sizes. I had a look at the log file (from stdout/stderr) and retreived the values corresponding to assembledContigCount and nonAssembledContigCount in
src/main/java/abra/ReAligner.java
for each region. I found that there was some variation in these values between the two runs (see the attached plot for the assembledContigCount variable - note that the values have been log10 transformed).This suggests that abra2 is non-deterministic. We also found that downstream variant calling was different between the two runs, possibly as a result of the varying output from abra2.
Do you know what might be the cause of this non-determinisim and Is there is any way to make abra2 deterministic?