sql-machine-learning / elasticdl

Kubernetes-native Deep Learning Framework
https://elasticdl.org
MIT License
733 stars 113 forks source link

a little question about embedding parameters synchronization #2519

Closed jasperzhong closed 3 years ago

jasperzhong commented 3 years ago

Great work! PS design doc says embedding parameters are replicated across servers and synchronization is needed for consistency. I have some concerns about the synchronization overhead since embedding layers are usually huge. For example, Facebook mentioned that their embedding tables of production may be terabytes in size (paper link).

workingloong commented 3 years ago

Great work! PS design doc says embedding parameters are replicated across servers and synchronization is needed for consistency. I have some concerns about the synchronization overhead since embedding layers are usually huge. For example, Facebook mentioned that their embedding tables of production may be terabytes in size (paper link).

Yes, there is the synchronization overhead of huge embedding layers. So, we can adjust the staleness in the gradient update to improve efficiency. For example, the ps can drop the gradient version is stale.