Closed DougTownsend closed 2 months ago
Thank you for your excellent contribution. It took me 8 hours to go trough you commits - but it was great as each step was perfectly described and the reasoning was clear. Runtime of the ./paranoid_tests.sh went down from 4:50 to 2:55 on my machine. This is impressive.
Best, Andreas
Note
This PR includes the changes from #76. If those changes are not accepted I can modify this PR to be based on the current release of primer3. The changes specific to this PR can be seen here.
What These Changes Do
The main performance gains come from removing a lot of redundant work within
thal
. These changes provide a roughly 2x speedup when callingthal
. The output ofthal
is not changed, so it calculates the exact same result in half the time.For dimer alignments,
thal
spent about half of its runtime inRSH
called bycalc_bulge_internal
, but the entropy and enthalpy thatRSH
was calculating was already calculated infill_matrix
before callingcalc_bulge_internal
. Now the precomputed entropy and enthalpy is passed tocalc_bulge_internal
removing the need to callRSH
.For monomer alignments,
thal
spent most of its runtime incalc_terminal_bp
which calculated entropy and enthalpy using theEND5_1-4
functions. These functions compute both the entropy and enthalpy of a stack, but then only returns one of those values. So in order to get both the entropy and the enthalpy,END5_X
would need to be called twice, and both values would be computed twice. I removed theEND5_X
functions entirely and inlined their functionality intocalc_terminal_bp
.Additionally,
RSH
andLSH
took an argumentdouble *EntropyEnthalpy
that was used to pass the computed entropy and enthalpy values back to the caller. For some reason the functions that calledRSH
andLSH
would declare adouble *SH
and then usemalloc
to allocate space for twodouble
. SinceSH
was being used as a local variable in these functions, I changed it to just use a local arraydouble SH[2]
. This removes a lot of unnecessary calls tomalloc
in deeply nested loops.In order to find these problems I needed to make a lot of structural changes to the code. The original code used a lot of macros and global variables which made it hard to tell where state was being modified. I removed almost all of the macros and reduced the scope of all non
const
global variables except for the thermodynamic parameter tables. With these changes, any state variables are passed as arguments to the functions that need them. Any state variables that are passed by reference but are not modified by the callee are now passed asconst
. This makes it more clear where the state is being modified, and also makes it so that, as long as all threads are using the same thermodynamic parameters,thal
is now thread safe. This allows users to callthal
from within their own code by just pastingthal.c
,thal.h
, andthal_default_params.h
into their project.Additional small changes:
fill_matrix
andfill_matrix2
are nowfill_matrix_dimer
andfill_matrix_monomer
CBI
andcalc_bulge_internal2
to make it more similar to how the corresponding dimer function is structuredmaxTm
andmaxTm2
functions and inlined their functionality into their correspondingfill_matrix
functions to make it more clear what is happening infill_matrix
Sstack
) and just directly access those tables where needed.SMALL_NON_ZERO
toSH
values. Comments say that it was done so the compiler knowsSH
was changed, but the compiler already knows thatisFinite
. In many placesG
is calculated withH
, then there are checks forG > 0
followed by!isFinite(H)
, but ifH
is infiniteG
will also be infinite, soG > 0
is all that is needed-march=native
to the compiler flags to allow for the use of vector instructions if availabledg
,ds
, anddh
fields to thethal_results
structTests
All
make test
and valgrind tests pass. On my computer, the valgrind tests went from taking 8660 seconds to 4524 seconds with these changes. So these changes will make testing future PRs much faster.Performance gains were measured by running
thal
on 1 million randomly generated sequences or pairs of sequences for the following conditions:All three of these inputs had a ~2x speedup. These performance gains do not include any of the gains from #76 since I initialized the thermodynamic parameters before starting timing, and timed calls to
thal
with the generated sequences only.Closing
Please let me know if you need me to make any modifications in order to accept these changes. I can provide further explanation of the changes I made if necessary. I understand that this PR makes a lot of changes, but I think that it will make future improvements much easier.