Major Prformance Improvements for thal

Note

This PR includes the changes from #76. If those changes are not accepted I can modify this PR to be based on the current release of primer3. The changes specific to this PR can be seen here.

What These Changes Do

The main performance gains come from removing a lot of redundant work within thal. These changes provide a roughly 2x speedup when calling thal. The output of thal is not changed, so it calculates the exact same result in half the time.

For dimer alignments, thal spent about half of its runtime in RSH called by calc_bulge_internal, but the entropy and enthalpy that RSH was calculating was already calculated in fill_matrix before calling calc_bulge_internal. Now the precomputed entropy and enthalpy is passed to calc_bulge_internal removing the need to call RSH.

For monomer alignments, thal spent most of its runtime in calc_terminal_bp which calculated entropy and enthalpy using the END5_1-4 functions. These functions compute both the entropy and enthalpy of a stack, but then only returns one of those values. So in order to get both the entropy and the enthalpy, END5_X would need to be called twice, and both values would be computed twice. I removed the END5_X functions entirely and inlined their functionality into calc_terminal_bp.

Additionally, RSH and LSH took an argument double *EntropyEnthalpy that was used to pass the computed entropy and enthalpy values back to the caller. For some reason the functions that called RSH and LSH would declare a double *SH and then use malloc to allocate space for two double. Since SH was being used as a local variable in these functions, I changed it to just use a local array double SH[2]. This removes a lot of unnecessary calls to malloc in deeply nested loops.

In order to find these problems I needed to make a lot of structural changes to the code. The original code used a lot of macros and global variables which made it hard to tell where state was being modified. I removed almost all of the macros and reduced the scope of all non const global variables except for the thermodynamic parameter tables. With these changes, any state variables are passed as arguments to the functions that need them. Any state variables that are passed by reference but are not modified by the callee are now passed as const. This makes it more clear where the state is being modified, and also makes it so that, as long as all threads are using the same thermodynamic parameters, thal is now thread safe. This allows users to call thal from within their own code by just pasting thal.c, thal.h, and thal_default_params.h into their project.

Additional small changes:

Renamed functions to make it clear whether they are used for monomer or dimer alignments e.g. fill_matrix and fill_matrix2 are now fill_matrix_dimer and fill_matrix_monomer
Combined CBI and calc_bulge_internal2 to make it more similar to how the corresponding dimer function is structured
Removed maxTm and maxTm2 functions and inlined their functionality into their corresponding fill_matrix functions to make it more clear what is happening in fill_matrix
Removed functions for accessing the thermodynamic parameter tables (e.g. Sstack) and just directly access those tables where needed.
Removed the addition of SMALL_NON_ZERO to SH values. Comments say that it was done so the compiler knows SH was changed, but the compiler already knows that
Removed a lot of redundant calls to isFinite. In many places G is calculated with H, then there are checks for G > 0 followed by !isFinite(H), but if H is infinite G will also be infinite, so G > 0 is all that is needed
Added -march=native to the compiler flags to allow for the use of vector instructions if available
Added dg, ds, and dh fields to the thal_results struct
Reorganized the ordering of the function prototypes and definitions to group together functions that are used together

Tests

All make test and valgrind tests pass. On my computer, the valgrind tests went from taking 8660 seconds to 4524 seconds with these changes. So these changes will make testing future PRs much faster.

Performance gains were measured by running thal on 1 million randomly generated sequences or pairs of sequences for the following conditions:

Length 20 hairpin
Length 20/20 dimer
Length 20/40 dimer

All three of these inputs had a ~2x speedup. These performance gains do not include any of the gains from #76 since I initialized the thermodynamic parameters before starting timing, and timed calls to thal with the generated sequences only.

Closing

Please let me know if you need me to make any modifications in order to accept these changes. I can provide further explanation of the changes I made if necessary. I understand that this PR makes a lot of changes, but I think that it will make future improvements much easier.

primer3-org / primer3