jojoelfe commented 2 years ago

Description

I have been trying to run match_template through the GUI on 96 GPUs. This very quickly results in many workers disconnecting, because they have not recieved a Job. I poked around a bit and it seems like the main_thread in the master becomes bogged down with combining the partial results (which all come in around the same time) so that it does not send out new jobs fast enough. This partially fixes that, even though not in the most satisfying way.

After this change 96 GPUs still do not work (but 64 do). Now workers disconnect because they complain that the master thread did not read all bytes of their DefinedResult. This might also be related with the master having to deal with too many results at a time. There might be another timeout that we can tweak, but at that point it seems like we should think about better ways to do this. Maybe there can be a pyramid of leaders and workers (for example every 10 workers have a intermediate leader which then report to the master).

I have rebased my feature branch to be current with the master branch using to minimize conflicts and headaches

[ ] yes
[ ] no

Which compilers were tested

[ ] g++
[ ] icpc
[ ] clang
[ ] other (please specify)

These changes are isolated to the

[ ] gui
[ ] core library
[ ] gpu core library
[ ] program it modifies

How has the functionality been tested?

Please describe the tests that you ran to verify your changes. Please also note any relevant details for your test configuration.

[ ] Tested manually from GUI
[ ] Tested manually from CLI
[ ] Passed console tests
[ ] Passed samples functional testing
[ ] other (please specify)

Checklist:

[ ] I have not changed anything that did not need to be changed
[ ] I have performed a self-review of my own code
[ ] I have commented my code, (w.r.t. why), particularly in hard-to-understand areas
[ ] I have made corresponding changes to the documentation {Ok to pass for now}
[ ] My changes generate no new warnings
[ ] Any dependent changes have been merged and published in downstream modules

jojoelfe commented 2 years ago

Mhm, I profiled that function and it only takes about 3s for 64 GPUs. Not great, but not responsible for the slowdown.

I think the main slowdown is the master thread having to read all that data coming from the workers. I'll profile that part a bit more.

bHimes commented 2 years ago

Mhm, I profiled that function and it only takes about 3s for 64 GPUs. Not great, but not responsible for the slowdown.

I think the main slowdown is the master thread having to read all that data coming from the workers. I'll profile that part a bit more.

I don't think the network should be rate-limiting if set up appropriately. If it is, then I guess if you could use the half-precision outputs would give you 2x and an overload to not deal with pixel_size would grab another ~1.1x

what are you using to profile? The StopWatch class?

bHimes commented 2 years ago

@jojoelfe as an alternative: making a cistem::main_thread_timeout:: namespace that defines the timeout for various panels would be an okay workaround for the time being, with a FIXME on the one in this case that is long

jojoelfe commented 2 years ago

I'll merge this for now. I think I have an idea how to make sure the master is not blocking during receiving of data, but it will take some more tests.

timothygrant80 / cisTEM

[Fix] Increase MaxJobWaitTime for match_template #430

Description

I have rebased my feature branch to be current with the master branch using to minimize conflicts and headaches

Which compilers were tested

These changes are isolated to the

How has the functionality been tested?

Checklist: