rcedgar / muscle

Multiple sequence and structure alignment with top benchmark scores scalable to thousands of sequences. Generates replicate alignments, enabling assessment of downstream analyses such as trees and predicted structures.
https://drive5.com/muscle
GNU General Public License v3.0
186 stars 21 forks source link

Move the random number generation to a dedicated object #35

Closed althonos closed 1 week ago

althonos commented 2 years ago

Hello Robert!

A little introduction first: I'm a PhD student in bioinformatics and I spend my (little) spare time bringing more biological software into Python, to make it easier and safer for newbie bioinformaticians to build more complex pipelines. I developed Pyrodigal, PyHMMER and PyFastANI, which are Python bindings to a statically compiled version of respectively Prodigal, HMMER3 and FastANI.

I was successful in building and wrapping MUSCLE v5 in an Python extension (see althonos/pymuscle5). At the moment, it can replicate results from a muscle -align invocation and interface with Sequence and MSA types through Python code. I also replaced the OpenMP-based parallelization with the Python one (using a ThreadPool from the standard library), which should be easier for porting across platforms.

However, because the random number generation in the current MUSCLE code is done globally, this makes parallel runs inconsistent because they will affect the global generator. The solution proposed here, also done in HMMER, is to have a dedicated Random Number Generator class, and local objects for each alignment. This way, you can run several MPCFlat instances in parallel in the same process. I've tested muscle -align on some test files and got similar results, but maybe you'd want to setup a more comprehensive test case.

I'll probably make similar PRs later on with other aspects like alphabet and HMM params if you're okay with it, so that it's easier to use from the library code.

rcedgar commented 2 years ago

Hi Martin -- thanks for this, I can see you did a lot of work on it! However, I'm not sure this update is really needed. If I understood correctly, the issue is reproducibility with > 1 threads. If so, I would consider that nice to have but not essential. Given that the code was not designed with multi-threading reproducibility in mind, it's hard for me to check whether the updates you did are going to cover all use-cases.

Your PR breaks the Windows build with Microsoft Visual Studio because getpid() is not supported. I didn't test on MacOS because I don't have access to a Mac and I don't know how to run a github action on the code that would result from accepting a pull request (maybe this is not possible). Mac is particularly tricky for porting because the gcc command is not really gcc, it's an alias for clang, plus Apple hacked many system headers in non-compatible ways, so for future reference please check builds on all OSs before submitting a PR. See notes here: https://github.com/rcedgar/muscle/wiki/Design-of-Makefiles-and-Visual-Studio-project-build.

This idiom is repeated many places, would be better to embed this in the rngclass:

    if (Seed == 0)
        Seed = (uint32_t) (time(0)*getpid());
    M.m_rng.srand(Seed);
althonos commented 2 years ago

Hi Robert,

If I understood correctly, the issue is reproducibility with > 1 threads

Yes, this is for multithreading support, at the moment there is no way to run the code from Python in parallel, even with the same parameters (which are globals right now), because of the global RNG. But even without multithreading in mind, it forces the Python wrapper to reseed on every alignment.

Your PR breaks the Windows build with Microsoft Visual Studio because getpid() is not supported.

Actually, getpid was already used before my PR, but directly in myutils.cpp, I just copied the code from there. I think the build breaks because of missing Windows-specific header inclusion in rng.cpp

I didn't test on MacOS because I don't have access to a Mac and I don't know how to run a github action on the code that would result from accepting a pull request (maybe this is not possible). Mac is particularly tricky for porting because the gcc command is not really gcc, it's an alias for clang, plus Apple hacked many system headers in non-compatible ways, so for future reference please check builds on all OSs before submitting a PR.

I also don't have access to a Mac to be honest. But I can help setting up GitHub Actions to test on the three platforms; I actually have tests for Windows/Linux/OSX for the pyMUSCLE wrapper (and noticed some differences in the results between Unix and Windows).

This idiom is repeated many places, would be better to embed this in the rng class:

  if (Seed == 0)
      Seed = (uint32_t) (time(0)*getpid());
  M.m_rng.srand(Seed);

I agree, I'll make a new constructor.

rcedgar commented 2 years ago

Thanks for the offer re. github Actions, actually they are already implemented in https://github.com/rcedgar/muscle/tree/main/.github/workflows,, the problem is I don't know how to run them on PR code without merging into the main branch. You can run an Action on a non-main branch, but AFAIK you can't run on action on a PR or merge a PR into a non-main branch.

Correction actually it is possible: https://github.blog/2016-08-15-change-the-base-branch-of-a-pull-request/.

rcedgar commented 2 years ago

I guess you could try the Actions on your fork, that would enabling testing that the build completes successfully.

althonos commented 2 years ago

I fixed the headers inclusion in rng.cpp, and also added the new source files to the VisualCode project files. I also enable the Actions in my fork and I can confirm it now that the changes compile on Windows.

Is there any particular reason not to trigger the CI on every push? Since the repository is public GitHub won't charge anything, and the build time is reasonable enough (<5m) that it can be enabled IMO.

rcedgar commented 2 years ago

Hi Martin -- I'm still not understanding the issue you are trying to solve here. How exactly does the global RNG prevent python from running in parallel? Regardless, the side effects of your proposed changes are too broad. It is an unreasonable burden on the rest of the code that it must now create rng objects and explicitly pass them around and down to every place where an RNG is needed. The fix should be isolated to the internals of the existing RNG implementation, randu32() plus maybe one or two more other functions e.g. setting the seed. If you're using OMP you can get a thread identifier (tid) from within randu32() using omp_get_thread_num() and use the tid to maintain a per-thread RNG state. If you're using some other multi-threading system, then there must be a similar solution. Then the RNG implementation remains isolated within RNG code where it belongs and the rest of the code is unaffected.