Scaling factorization machine on Apache Spark with parameter servers

This is a Spark Summit EU 2016 talk by Nick Pentreath from IBM.

Distributed machine learning algorithms are usually implemented in Spark MLLib in this way
1. Driver broadcasts weights to tasks distributed across machines
2. Tasks update the local weights on a batch of data
3. Driver gathers and aggregates(treeAggregate) all the weights from tasks
4. Loop
Driver can easily become the bottleneck and be shot down by a big model. That's where parameter servers comes in. Despite early investigation and prototype from the community, Spark hasn't yet supported parameter servers or event long running services.
Unlike MLLib, this FM implementation is built with Glint, a high performance parameter server using Akka.
Although it has shown better performance and scalability than one MLLib way implementation, there are remaining challenges (Akka frame size, backpressure) and future improvements can be drawn from DMLC ps-lite and difacto

manuzhang / read-it-now