lyst / lightfm

A Python implementation of LightFM, a hybrid recommendation algorithm.
Apache License 2.0
4.73k stars 688 forks source link

Finding the precision and auc scores. #369

Open nishal999 opened 6 years ago

nishal999 commented 6 years ago

I am building a recommendation model for user-article dataset where each interaction is represented by 1.

model = LightFM(loss='warp', item_alpha=ITEM_ALPHA, user_alpha=USER_ALPHA, no_components=NUM_COMPONENTS, learning_rate=LEARNING_RATE, learning_schedule=LEARNING_SCHEDULE)

model = model.fit(train, item_features=itemf, user_features=uf, epochs=NUM_EPOCHS, num_threads=NUM_THREADS)

print("train shape: ",train.shape) print("test shape: ",test.shape)

train shape: (25900, 790) test shape: (25900, 790)

My predict model looks like this:

predictions = model.predict( user_id, pid_array, user_features=uf, item_features=itemf, num_threads=4)

where pid_array are indexes of number of items

train_precision = precision_at_k(model, train, k=10).mean()

I am trying to predict the precision and subsequently want auc score also. But I get this error.

Traceback (most recent call last): File "new_light_fm.py", line 366, in train_precision = precision_at_k(model, train, k=10).mean() File "/home/nt/anaconda3/lib/python3.6/site-packages/lightfm/evaluation.py", line 69, in precision_at_k check_intersections=check_intersections, File "/home/nt/anaconda3/lib/python3.6/site-packages/lightfm/lightfm.py", line 807, in predict_rank raise ValueError('Incorrect number of features in item_features') ValueError: Incorrect number of features in item_features

DoronGi commented 6 years ago

Your model is using user and item features. The evaluation functions need these arguments. Try:

train_precision = precision_at_k(model, train, k=10, user_features=uf, item_features=itemf).mean()
nishal999 commented 6 years ago

Ahh yes. Now I get it. Its solved now. Thank you so much!!!

nishal999 commented 6 years ago

I am trying to validate my model with respect to train and test data. For example, I want to include only interactions for 11 months as my training set and want to test it on userids for the 12th month. The problem here is, I cannot have different dimensionalities for train and test as mentioned by @maciejkula here. I have lost direction as to how I should be proceeding further.

DoronGi commented 6 years ago

There is no problem to do so. You build one dataset to which you fit all your users, user-features, items and item-features by calling fit and fit_partial. Similarly you call build_user_features and build_item_features to build the feature matrices for all users and items. Next you call build_interactions twice, once with the interactions of the first user group (1-11 months) to get the test interaction matrix, and the second time with the interactions of the second user group (12th month) to get the test matrix.

maciejkula commented 6 years ago

Again, @DoronGi's answer is exactly correct.

nishal999 commented 6 years ago

Thank you @maciejkula and @DoronGi

ctivanovich commented 5 years ago

I have this problem, but without using features, e.g.

model.item_biases = 0
model.fit(X_train, 
          num_threads = 6)

train_precision = precision_at_k(model, X_train, k=10).mean()
test_precision = precision_at_k(model, X_test, k=10).mean()

I get the same error at the test_precision call about having an incorrect number of user features. The dimensions of my train and test matrices are as follows:


4290x40744 sparse matrix of type 'class 'numpy.float32'
    with 73414 stored elements in Compressed Sparse Row format,
1430x40744 sparse matrix of type 'class 'numpy.float32''
    with 26586 stored elements in Compressed Sparse Row format```.
EralpB commented 5 years ago

@ctivanovich I think you should keep the matrix sizes the same, just fill 0 rows to the users you want to exclude, this way it's much less confusing. this library (or maybe it's industry standard) assumes row # = user id.

The way I do that is I convert the interactions to lil matrix and just set = 0 the interactions I want to exclude then I convert to coo again and train/test.

tracthuc commented 4 years ago

I have a question why we need k to compute precision, but auc_score does not require? Thank you

ctivanovich commented 4 years ago

@tracthuc This isn't really the place for a question like that, you should be asking on e.g. StackExchange. But in a nut shell, AUC simply doesn't require k, it's not a measure related to the ordinality of the recommendations. I highly recommend this video series: https://www.youtube.com/watch?v=4jRBRDbJemM.

adsk2050 commented 3 years ago

I have a sparse matrix(train/test data) of shape (1407580, 235061), which means there are around 330Bn combinations of user_id and item_id. This is causing precision_at_k and others to take way too much time to calculate. I am thinking about calculating the precision at k only for a small set of data by writing code myself. Will this be good enough for model validation?

selalamiTF commented 3 years ago

There is no problem to do so. You build one dataset to which you fit all your users, user-features, items and item-features by calling fit and fit_partial. Similarly you call build_user_features and build_item_features to build the feature matrices for all users and items. Next you call build_interactions twice, once with the interactions of the first user group (1-11 months) to get the test interaction matrix, and the second time with the interactions of the second user group (12th month) to get the test matrix.

Wouldn't be better to split the metadata from the beginning also ? as we normally do for other ML problems ?