Open PeterZaidel opened 4 years ago
How huge is your dataset? < 100M observations should be very doable on a single machine.
Thanks for reply! My dataset is more than 10 billions observations. But number of items is much less than users. Nearly 100M users and 7M items. It means that all item vectors could be stored on one machine (node in cluster). Now I'm trying train lightfm with map-reduce and compare it with pyspark I-ALS. On small portion of data lightfm gives better metrics and predictions.
That it quite a bit of data. I would recommend:
How are you doing map-reduce with LightFM? I am not quite sure that is possible.
Firstly I train lightFM on sample of data splitted by users and save item and user vectors. Then I put it as initial parameters to mapreduce. Every mapper receives interactions for some users, all item vectors and perform one or two epochs with lightfm. User vectors save on hdfs. On reduce stage I average all income item vectors and save it for the next stage. I do this several times until vectors stabilize and save final vectors for users and items.
That sounds very sensible, it looks like the sparsity of your dataset makes it amenable to this form of distributed training.
I'm trying to consider some speed-up solution as well that might involve distributing operations. I have a much smaller dataset that includes some ~250 user features and item features (items here being other users). Input is some 8000 users and items, with some 10,000,000 interactions, and the runtime was like 13 hours on a fairly powerful VM. Should I be running into this kind of severe slow down with only 150 epochs of training?
Hello! I have a huge dataset which can not be fitted on a single machine and data has much more users than items. Now I'm thinking about training LightFm on cluster. How can I do it? Can I train LightFm on nodes using only interactions for subset of users and update vector representations as mean of nodes vectors? Is there a way to do it better?