j-woz / exm-issues

Automatically exported from code.google.com/p/exm-issues
0 stars 0 forks source link

Congestion management for ADLB servers #667

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Recent benchmarking has revealed that congestion collapse can happen with ADLB 
servers in certain circumstances.  I don't fully understand the exact causes, 
but it appears to be a combination of busy servers developing full queues of 
work, and the deadlock avoidance algorithm then causing long chains of servers 
waiting on each other.

Some ideas:
- Non-blocking stealing probes (i.e. PROBE -> RESPONSE -> CONFIRM -> RECEIVE 
WORK rather than PROBE -> RECEIVE WORK) to avoid delays spreading too much
- Preallocating sync buffers and using ADLB_IRecv to prevent syncs queuing up 
and blocking other servers
- Some sort of adaptive algorithms to shift load from congested servers 
(putting data in different places) or avoid sending work stealing probes to 
congested servers.
- Work-stealing algorithms that use less probes, such as lifelines.

Original issue reported on code.google.com by tim.g.ar...@gmail.com on 19 Apr 2014 at 3:24

GoogleCodeExporter commented 9 years ago
I've been working on some of the lower-hanging fruit here on the 
issue-586-engine branch - non-blocking steal probes and preallocated sync 
buffers are implemented, and I've also been looking at adding a cache for 
closed variables to reduce the number of subscribes.

Original comment by tim.g.ar...@gmail.com on 2 May 2014 at 6:10

GoogleCodeExporter commented 9 years ago
Issue 666 has been merged into this issue.

Original comment by tim.g.ar...@gmail.com on 7 May 2014 at 10:31