Open rupertnash opened 9 years ago
@rupertnash, it might be best to open a pull-request for that branch. That makes it easier to see what the changes are, whether the test pass etc.
I think it's great that we finally got around to merge these improvements, @rupertnash.
WRT 3, I agree with you. I would go as far as deleting the PhasedBroadcast infrastructure all together. I don't see much value in maintaining both implementations.
@mobernabeu good! The only sad thing is that we can't entirely remove the horror because of the visualisation stuff. That can't be replaced by an MPI collective because the sizes of inputs and output of the reduction function are not the same.
And we're definitely OK with requiring MPI 3.0? I think it's reasonable.
What a great reason to deprecate the runtime visualisation stuff, isn't it? :smile: I would be happy to put that code in a separate branch and move on without it.
I remember there were some attempts at decoupling runtime visualisation from HemeLB during CRESTA. Any comments @djgroen?
I'm fine with moving to MPI 3.0. Is it safe to assume that most supercomputers currently support this version of the standard?
I have just had an idea: I can propose a data science MSc project to sort out the vis for HemeLB. Rip out the hand made stuff and put something real in (e.g. VTK/Paraview/the DLR thing). And then get rid of it.
:+1:
Since, I was about to start some MPI, I figured I would try and see I could get the nbc branch to work with master.
So nbc_rebase is a rebase of this branch onto master (I think rebasing will make it easier to then merge nbc with the red-blood-cell branch). The tests pass. But regression still segfaults when run with more than one processor.
I may or may not have time to figure out what is going wrong, but the rebase is there if you want to use it, @rupertnash.
Thanks, I started looking into this. There was a problem with the debugger which I've fixed on this branch.
I guess the PR #591 is now defunct?
I've also chased up the OpenMPI bug. It is fixed on Github and will be released in v 1.10.1
This commit is the fix from which you may be able to patch your install: https://github.com/open-mpi/ompi/commit/c99a8a55babe6d260c96332bbe85ce8ac496edbb
Hi chaps,
So I just had a think about this and have the following design for the WhateverTesters, apropos of my point 2 above (Switch from MPI_Wait to MPI_Test for the monitoring parts). I'd appreciate your thoughts on this before I implement. The table shows the actions for each Step
of a Phase
depending on whether a collective is in flight or not at that moment.
Step | running = false | running = true |
---|---|---|
BeginAll | Start here | |
BeginPhase / RequestComms | NOP | NOP |
PreSend | Check local stability | NOP |
Send | MPI_Iallreduce - set running = true | NOP |
PreWait / PreRecv | NOP | NOP |
Wait | NOP | MPI_Test; if done set running = false |
EndPhase / PostReceive | SetState | NOP |
The SteeringComponent
is trickier because ideally it should only execute on demand, but I don't really know how MPI feels about having all the ranks except the steering one waiting for the collective to run. I'm going to talk to one of the Epigram folks about this. My current position is that I'll refactor the current CollectiveAction
into a waiting and a testing version.
As discussed on slack, point 4 is to be resolved by removing the vis and hence the need for the tangle of horrors that is the phased broadcast
With these decisions this ticket can be closed
I was just about to submit a PR, but I noticed that there's an odd performance issue: the diffTest simulation is taking a few seconds to run as normal, but over 20 s to shut down. There's clearly something odd going on. I will investigate next week.
The work I did as part of CRESTA to use MPI 3.0 NBCs instead of the PhasedBroadcast machinery should be reexamined and merged into master.
Things to do before merge
@mobernabeu: I would appreciate you opinion on point 3
@mdavezac / @garymacindoe: I would appreciate your comments on any problems this may cause for the RBC branch merge.