lyst / lightfm

A Python implementation of LightFM, a hybrid recommendation algorithm.
Apache License 2.0
4.78k stars 692 forks source link

Distributed training #503

Open PeterZaidel opened 4 years ago

PeterZaidel commented 4 years ago

Hello! I have a huge dataset which can not be fitted on a single machine and data has much more users than items. Now I'm thinking about training LightFm on cluster. How can I do it? Can I train LightFm on nodes using only interactions for subset of users and update vector representations as mean of nodes vectors? Is there a way to do it better?

maciejkula commented 4 years ago

How huge is your dataset? < 100M observations should be very doable on a single machine.

PeterZaidel commented 4 years ago

Thanks for reply! My dataset is more than 10 billions observations. But number of items is much less than users. Nearly 100M users and 7M items. It means that all item vectors could be stored on one machine (node in cluster). Now I'm trying train lightfm with map-reduce and compare it with pyspark I-ALS. On small portion of data lightfm gives better metrics and predictions.

maciejkula commented 4 years ago

That it quite a bit of data. I would recommend:

  1. Not training on all historical data, but selecting a sample of more recent data. This might be useful not only from an efficiency perspective, but also to account for changing tastes and item distribution.
  2. Training on a large machine. Assuming your data is sufficiently sparse, you may get good performance even with 64 threads.

How are you doing map-reduce with LightFM? I am not quite sure that is possible.

PeterZaidel commented 4 years ago

Firstly I train lightFM on sample of data splitted by users and save item and user vectors. Then I put it as initial parameters to mapreduce. Every mapper receives interactions for some users, all item vectors and perform one or two epochs with lightfm. User vectors save on hdfs. On reduce stage I average all income item vectors and save it for the next stage. I do this several times until vectors stabilize and save final vectors for users and items.

maciejkula commented 4 years ago

That sounds very sensible, it looks like the sparsity of your dataset makes it amenable to this form of distributed training.

ctivanovich commented 4 years ago

I'm trying to consider some speed-up solution as well that might involve distributing operations. I have a much smaller dataset that includes some ~250 user features and item features (items here being other users). Input is some 8000 users and items, with some 10,000,000 interactions, and the runtime was like 13 hours on a fairly powerful VM. Should I be running into this kind of severe slow down with only 150 epochs of training?