lyst / lightfm

A Python implementation of LightFM, a hybrid recommendation algorithm.
Apache License 2.0
4.67k stars 679 forks source link

How to use LightFM with sparse interaction data? #424

Open pugantsov opened 5 years ago

pugantsov commented 5 years ago

I am currently exploring recommender systems for a masters project at university. My dataset consists of tweets by users with a bunch of brief user metadata such as location and item metadata such as tweet body, hashtags used and industry, sector information about hashtag if present, timestamp etc.

My problem is I'm finding it difficult to recommend items because there is only ever a one-to-one interaction between a user and their authored post. The item features slightly increase scores on the training set but I have no luck on the test set.

Am I maybe looking at a different approach? I do have a slight profile for each user on the basis of their post content but how would I model that for recommendation? Does the sparsity of the interaction matrix matter? I'm quite new to all this so most of my issues are from how to frame my input data.

impaktor commented 5 years ago

There is a limit to how well any recommendation system can do, based on the data. If there's almost no interactions, especially if each user only interacted with one item, rather than several different items, then it will naturally be difficult to build a good recommender system.

I would try to add more features that can allow the model to "link" what users have in common between user-item interactions and feature. E.g. geography (city, country), interest/political affiliation, twitter-verified, etc. You could also do text embedding on the twitter user's profile description text, and use that as feature.

pugantsov commented 5 years ago

@impaktor Yeah, I think my project is more focused around alleviating cold-start issues and proving empirically that adding metadata, external knowledge, embeddings etc can increase various evaluation scores. Am I right in thinking that regardless of data availability, it's good to have some element of collaborative filtering? I wasn't sure if the limited interactions were somehow negatively impacting the results. I've been struggling to find quality content-based approaches.

impaktor commented 5 years ago

Adding more features / "metadata" should help if the metadata is meaningful, i.e. actually has a pattern that helps your predictions (e.g. male/female, age, etc.). But adding more features will give you more parameters, so training will take longer, and depending on the amount of data you have to train on, you risk overfitting on the training set, giving poor performance on the test set.

I see LightFM looks to have regularization parameters, but these are by default 0. Maybe something to play with if you're in the overfitting-domain.

maciejkula commented 5 years ago

Do I understand this correctly when you say that you have pairs of (tweet author, tweet)?

If this is the case, then I think the best you can do is say things like "people from country X write tweets that contain words V, Y, and Z". Is this a relationship you are interested in?

pugantsov commented 5 years ago

@maciejkula My data is specifically in this format: StockTwits: Message JSON Format

With regard to your question about what relationship I would be interested in, I've been tasked to recommend other tweets based on similar content/similar 'type' of stocks based on stock 'cashtags' or maybe content embeddings, for now I'm playing around with what information could potentially be useful. I'm already using the

"symbols": [{
    "id": 686,
    "symbol": "ABC",
    "title": "Alpha Bravo Charlie, inc.",
    "is_following": false,
    "exchange": "NASDAQ",
    "sector": "Technology",
    "industry": "Personal Computers",
    "logo_url": "http://logos.xignite.com/NASDAQGS/00011843.gif",
    "trending": true,
    "trending_score": 16.4019,
    "watchlist_count": 12370
}],

to recommend items in say Technology, like above or even more specifically, Personal Computers.

For now I've used user's location which has been identified row-by-row with spaCy's Named Entity Recognition then converted to say 'newyork' for instance. Item features are a tweet's cashtags (symbol), the sector and industry they belong to but I'm getting really bad results on my test sets.

2019-02-28 18:53:31,849 [MainThread  ] [INFO ]  The dataset has 176 users and 92363 items with 18473 interactions in the test and 73890 interactions in the training set.
2019-02-28 18:53:31,849 [MainThread  ] [INFO ]  Begin fitting collaborative filtering model...
2019-02-28 18:53:34,157 [MainThread  ] [INFO ]  Collaborative Filtering training set AUC: 0.9582749
2019-02-28 18:53:35,507 [MainThread  ] [INFO ]  Collaborative Filtering test set AUC: 0.28770456
2019-02-28 18:53:35,507 [MainThread  ] [INFO ]  There are 92 distinct user locations, 9 distinct sectors, 215 distinct industries and 3929 distinct cashtags.
2019-02-28 18:53:35,508 [MainThread  ] [INFO ]  Begin fitting hybrid model...
2019-02-28 18:53:38,968 [MainThread  ] [INFO ]  Hybrid training set AUC: 0.8867248
2019-02-28 18:53:40,068 [MainThread  ] [INFO ]  Hybrid test set AUC: 0.80986875
2019-02-28 18:53:43,067 [MainThread  ] [INFO ]  Hybrid training set Precision@10: 0.27272728
2019-02-28 18:53:44,150 [MainThread  ] [INFO ]  Hybrid test set Precision@10: 0.0011363636
2019-02-28 18:53:47,148 [MainThread  ] [INFO ]  Hybrid training set Recall@10: 0.009839114903936408
2019-02-28 18:53:48,183 [MainThread  ] [INFO ]  Hybrid test set Recall@10: 0.00015057677451455357
2019-02-28 18:53:48,184 [MainThread  ] [INFO ]  Hybrid training set F1 Score: 0.018993023190626002
2019-02-28 18:53:48,184 [MainThread  ] [INFO ]  Hybrid test set F1 Score: 0.0002659174721051054
2019-02-28 18:53:51,185 [MainThread  ] [INFO ]  Hybrid training set MRR: 0.33839628
2019-02-28 18:53:52,304 [MainThread  ] [INFO ]  Hybrid test set MRR: 0.004157745

I've got a feeling I'm maybe overfitting with the 3929 cashtags and was thinking of just using the sector, industry tags and just making the tweet bodies into an embedding using something like word2vec/doc2vec? Maybe also using the 'trending score' as a feature weight for each industry per tweet as well?

Any pointers would be greatly appreciated.

maciejkula commented 5 years ago

It does look like some of your features are too granular and help you overfit on the training set. Dropping rare features or adding regularization might help

pugantsov commented 5 years ago

@maciejkula I've decided to make Doc2Vec representations of each tweet body and take out the cashtags. Is there a way to pass pre-trained embeddings to the model?

I saw that:

item_embeddings: np.float32 array of shape [n_item_features, n_components]
         Contains the estimated latent vectors for item features. The [i, j]-th
         entry gives the value of the j-th component for the i-th item feature.
         In the simplest case where the item feature matrix is an identity
         matrix, the i-th row will represent the i-th item latent vector.

But I was a little confused. Would I just train the model as normal without passing any item_features then set the item_embeddings property to an np matrix that I make myself?

What I wanted to know was, if I was just to use an embedding for each tweet body but retain industry + sector information as separate features to be passed in, would this be possible?

phiweger commented 4 years ago

@ajhepburn did you make any progress, like, can one just do ...

model = LightFM(loss='warp')
model.item_embeddings = ...
model.user_embeddings = ...

... to pass embeddings? which order do they have to have?