Optimizing Large-scale Deep Learning by Minimizing Resource Contention for Data Processing - Githubissues

joapolarbear / dl_notes

1 stars 1 forks source link

Optimizing Large-scale Deep Learning by Minimizing Resource Contention for Data Processing #32

Open joapolarbear opened 3 years ago

joapolarbear commented 3 years ago

A good explanation of Horovod Workflow
Solve the problem that horovod back ground threads needs to sync to each other to check which tensors are ready, which may be time consuming

Solution

Use global sleep time instead of local cycle time to avoid oversleep
Nonblocking Cache Synchronization
Static CPU Resource Partitioning
Graph Topology Exploitation, to ensure the tensor order.