lyst / lightfm

A Python implementation of LightFM, a hybrid recommendation algorithm.
Apache License 2.0
4.74k stars 691 forks source link

Item cold-start prediction #556

Open ZaruhiNavasardyan opened 4 years ago

ZaruhiNavasardyan commented 4 years ago

When prediction with Lightfm model fitted with item-features for a new item, what should the item-id be in the predict function? In fact it is working with 0, is that the right approach and if yes what is the reasoning behind?. What if we want to predict for several new items?

Thanks in advance.

arsine1996 commented 4 years ago

The same for me, what is the right approach for making prediction for a new item using its features?

Thanks

freytheviking commented 4 years ago

I have had some issues with using LightFM for item cold-start prediction as well. Specifically, it was due to some dimensional mismatch from item-embeddings.

My use case, is that there will be many new items entering our e-commerce platform and they need to be recommended, if appropriate, to our existing users. Here is what I have worked out so far. For anyone who has experience with this, please let me know if I am on the right track or can offer any advice.

Anyway, here's what I have done so far...

1) First, I trained a model with item features. This hasn't been a problem and the model seems to be creating some very meaningful recommendations. I have also visualized and sanity-checked the item embeddings using t-SNE and the embeddings look great.

2) Each item in LightFM has an associated index that refers to that item. This can be found in lightfm.data.Dataset._item_id_mapping as well as lightfm.data.Dataset._item_feature_mapping. If a new item enters the system, there is obviously no index, so we will need to create it. I added a new index with:

lightfm.data.Dataset.fit_partial(items=['cold_start_item'])

Checked this and it seems to have worked.

3) We'll need to generate a new feature matrix if you used identity features. I did something like this:

existing_feature = [('item_1, ['tag_1', 'tag_2']),
                                ('item_2', ['tag_3', 'tag_4', 'tag_5']),
                                ('item_3', ['tag_1']),
                                ('item_4', ['tag_3', 'tag_5', 'tag_10'])]

cold_start_feature = [('cold_start_item', ['tag_1', 'tag_2'])
features = existing_feature + cold_start_feature

new_feature_matrix = lightfm.data.Dataset.build_item_features(features)

Also sanity checked this and it seems to have been created correctly

  1. Predict step. Here is where I ran into issues. To predict, I did something like this:
# Predict for user with index = 0, the score for item with index 4, which is the 'cold_start_item'
lightfm.LightFM.predict(0, item_ids=np.array([4]), item_features=new_feature_matrix)

I got the following error msg:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)

~/.local/share/virtualenvs/recommender-research-Cn0KHt1c/lib/python3.8/site-packages/lightfm/lightfm.py in predict(self, user_ids, item_ids, item_features, user_features, num_threads)
    714 
    715         (user_features,
--> 716          item_features) = self._construct_feature_matrices(n_users,
    717                                                            n_items,
    718                                                            user_features,

~/.local/share/virtualenvs/recommender-research-Cn0KHt1c/lib/python3.8/site-packages/lightfm/lightfm.py in _construct_feature_matrices(self, n_users, n_items, user_features, item_features)
    312         if self.item_embeddings is not None:
    313             if not self.item_embeddings.shape[0] >= item_features.shape[1]:
--> 314                 raise ValueError('The item feature matrix specifies more '
    315                                  'features than there are estimated '
    316                                  'feature embeddings: {} vs {}.'.format(

ValueError: The item feature matrix specifies more features than there are estimated feature embeddings: x vs y.

It seems to be telling me that I have more features than feature embeddings, which is obviously true because I added an additional item to lightfm.data.Dataset._item_id_mapping and lightfm.data.Dataset._item_feature_mapping. However, there is also no feature_embedding/identity embedding for that specific item because it's a new item. As a hack, I just concatenated another vector to lightfm.LightFM.item_embeddings. I called predict again and it worked but gave me a wildly off number (2.3x10^12...).

Is the above approach at least on the right track? Can anyone point out any issues and recommend what to do? Thanks in advance! Would really love to be able to do cold start prediction since that's the reason why LightFM was created in the first place!

igorkf commented 3 years ago

I think every time you have a new item, you should use model.fit_partial() on the pre-trained model, to add this new item to the existing model.

wagnerjorge commented 1 year ago

You can use a get_dummies (from pandas) in your data preparation in items and users data. This solves the problem.