Closed jojoelfe closed 2 years ago
Mhm, I profiled that function and it only takes about 3s for 64 GPUs. Not great, but not responsible for the slowdown.
I think the main slowdown is the master thread having to read all that data coming from the workers. I'll profile that part a bit more.
Mhm, I profiled that function and it only takes about 3s for 64 GPUs. Not great, but not responsible for the slowdown.
I think the main slowdown is the master thread having to read all that data coming from the workers. I'll profile that part a bit more.
I don't think the network should be rate-limiting if set up appropriately. If it is, then I guess if you could use the half-precision outputs would give you 2x and an overload to not deal with pixel_size would grab another ~1.1x
what are you using to profile? The StopWatch class?
@jojoelfe as an alternative: making a cistem::main_thread_timeout:: namespace that defines the timeout for various panels would be an okay workaround for the time being, with a FIXME on the one in this case that is long
I'll merge this for now. I think I have an idea how to make sure the master is not blocking during receiving of data, but it will take some more tests.
Description
I have been trying to run match_template through the GUI on 96 GPUs. This very quickly results in many workers disconnecting, because they have not recieved a Job. I poked around a bit and it seems like the main_thread in the master becomes bogged down with combining the partial results (which all come in around the same time) so that it does not send out new jobs fast enough. This partially fixes that, even though not in the most satisfying way.
After this change 96 GPUs still do not work (but 64 do). Now workers disconnect because they complain that the master thread did not read all bytes of their DefinedResult. This might also be related with the master having to deal with too many results at a time. There might be another timeout that we can tweak, but at that point it seems like we should think about better ways to do this. Maybe there can be a pyramid of leaders and workers (for example every 10 workers have a intermediate leader which then report to the master).
I have rebased my feature branch to be current with the master branch using to minimize conflicts and headaches
Which compilers were tested
These changes are isolated to the
How has the functionality been tested?
Please describe the tests that you ran to verify your changes. Please also note any relevant details for your test configuration.
Checklist: