a little question about embedding parameters synchronization

sql-machine-learning / elasticdl

Kubernetes-native Deep Learning Framework

MIT License

733 stars 113 forks source link

Great work! PS design doc says embedding parameters are replicated across servers and synchronization is needed for consistency. I have some concerns about the synchronization overhead since embedding layers are usually huge. For example, Facebook mentioned that their embedding tables of production may be terabytes in size (paper link).

Yes, there is the synchronization overhead of huge embedding layers. So, we can adjust the staleness in the gradient update to improve efficiency. For example, the ps can drop the gradient version is stale.

sql-machine-learning / elasticdl

a little question about embedding parameters synchronization #2519