recommenders-team / recommenders

Best Practices on Recommendation Systems
https://recommenders-team.github.io/recommenders/intro.html
MIT License
19.2k stars 3.1k forks source link

[ASK] SLi Rec model doesn't improve on training. #1966

Closed manishvee closed 1 year ago

manishvee commented 1 year ago

Description

I tried running the quick start Notebook provided here on my local machine. As far as I could tell, I had all the necessary training files required for the model. However, at the end of 10 epochs, the model only achieved an AUC score of ~0.5 after starting from ~0.49. I don't really understand why this is happening. The only factor I can think of that might cause issues is that I'm using Tensorflow for M2 Macs, and I understand that there are some issues with it. But other than this, I'm pretty stuck. Any help would be appreciated.

Other Comments

Here's the output from the model training:

2023-08-14 10:47:23.473809: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled. step 20 , total_loss: 1.5846, data_loss: 1.5846 step 40 , total_loss: 1.3459, data_loss: 1.3459 eval valid at epoch 1: auc:0.493,logloss:0.6977,mean_mrr:0.44,ndcg@2:0.3008,ndcg@4:0.4984,ndcg@6:0.577,group_auc:0.482 step 20 , total_loss: 0.5859, data_loss: 0.5859 step 40 , total_loss: 0.3387, data_loss: 0.3387 eval valid at epoch 2: auc:0.4896,logloss:0.5833,mean_mrr:0.446,ndcg@2:0.3092,ndcg@4:0.4981,ndcg@6:0.5814,group_auc:0.4836 step 20 , total_loss: 0.1450, data_loss: 0.1450 step 40 , total_loss: 0.0671, data_loss: 0.0671 eval valid at epoch 3: auc:0.4984,logloss:0.5172,mean_mrr:0.4526,ndcg@2:0.3202,ndcg@4:0.5088,ndcg@6:0.5865,group_auc:0.494 step 20 , total_loss: 0.0591, data_loss: 0.0591 step 40 , total_loss: 0.0373, data_loss: 0.0373 eval valid at epoch 4: auc:0.4955,logloss:0.512,mean_mrr:0.451,ndcg@2:0.3171,ndcg@4:0.5068,ndcg@6:0.5853,group_auc:0.4933 step 20 , total_loss: 0.0432, data_loss: 0.0432 step 40 , total_loss: 0.0255, data_loss: 0.0255 eval valid at epoch 5: auc:0.5046,logloss:0.7092,mean_mrr:0.4623,ndcg@2:0.3324,ndcg@4:0.5199,ndcg@6:0.594,group_auc:0.5067 step 20 , total_loss: 0.0106, data_loss: 0.0106 step 40 , total_loss: 0.0288, data_loss: 0.0288 eval valid at epoch 6: auc:0.4986,logloss:3.6601,mean_mrr:0.4512,ndcg@2:0.3187,ndcg@4:0.5119,ndcg@6:0.5857,group_auc:0.4982 step 20 , total_loss: 0.0159, data_loss: 0.0159 step 40 , total_loss: 0.0096, data_loss: 0.0096 eval valid at epoch 7: auc:0.4913,logloss:0.5893,mean_mrr:0.4509,ndcg@2:0.3208,ndcg@4:0.5024,ndcg@6:0.5852,group_auc:0.49 step 20 , total_loss: 0.0285, data_loss: 0.0285 step 40 , total_loss: 0.0048, data_loss: 0.0048 eval valid at epoch 8: auc:0.5065,logloss:0.5384,mean_mrr:0.461,ndcg@2:0.3337,ndcg@4:0.5198,ndcg@6:0.5931,group_auc:0.5088 step 20 , total_loss: 0.0056, data_loss: 0.0056 step 40 , total_loss: 0.0060, data_loss: 0.0060 eval valid at epoch 9: auc:0.5133,logloss:0.8202,mean_mrr:0.4725,ndcg@2:0.3499,ndcg@4:0.529,ndcg@6:0.6018,group_auc:0.5189 step 20 , total_loss: 0.0055, data_loss: 0.0055 step 40 , total_loss: 0.0129, data_loss: 0.0129 eval valid at epoch 10: auc:0.5015,logloss:0.5818,mean_mrr:0.4608,ndcg@2:0.3305,ndcg@4:0.5154,ndcg@6:0.5928,group_auc:0.5038 [(1, {'auc': 0.493, 'logloss': 0.6977, 'mean_mrr': 0.44, 'ndcg@2': 0.3008, 'ndcg@4': 0.4984, 'ndcg@6': 0.577, 'group_auc': 0.482}), (2, {'auc': 0.4896, 'logloss': 0.5833, 'mean_mrr': 0.446, 'ndcg@2': 0.3092, 'ndcg@4': 0.4981, 'ndcg@6': 0.5814, 'group_auc': 0.4836}), (3, {'auc': 0.4984, 'logloss': 0.5172, 'mean_mrr': 0.4526, 'ndcg@2': 0.3202, 'ndcg@4': 0.5088, 'ndcg@6': 0.5865, 'group_auc': 0.494}), (4, {'auc': 0.4955, 'logloss': 0.512, 'mean_mrr': 0.451, 'ndcg@2': 0.3171, 'ndcg@4': 0.5068, 'ndcg@6': 0.5853, 'group_auc': 0.4933}), (5, {'auc': 0.5046, 'logloss': 0.7092, 'mean_mrr': 0.4623, 'ndcg@2': 0.3324, 'ndcg@4': 0.5199, 'ndcg@6': 0.594, 'group_auc': 0.5067}), (6, {'auc': 0.4986, 'logloss': 3.6601, 'mean_mrr': 0.4512, 'ndcg@2': 0.3187, 'ndcg@4': 0.5119, 'ndcg@6': 0.5857, 'group_auc': 0.4982}), (7, {'auc': 0.4913, 'logloss': 0.5893, 'mean_mrr': 0.4509, 'ndcg@2': 0.3208, 'ndcg@4': 0.5024, 'ndcg@6': 0.5852, 'group_auc': 0.49}), (8, {'auc': 0.5065, 'logloss': 0.5384, 'mean_mrr': 0.461, 'ndcg@2': 0.3337, 'ndcg@4': 0.5198, 'ndcg@6': 0.5931, 'group_auc': 0.5088}), (9, {'auc': 0.5133, 'logloss': 0.8202, 'mean_mrr': 0.4725, 'ndcg@2': 0.3499, 'ndcg@4': 0.529, 'ndcg@6': 0.6018, 'group_auc': 0.5189}), (10, {'auc': 0.5015, 'logloss': 0.5818, 'mean_mrr': 0.4608, 'ndcg@2': 0.3305, 'ndcg@4': 0.5154, 'ndcg@6': 0.5928, 'group_auc': 0.5038})] best epoch: 9 Time cost for training is 52.33 mins

miguelgfierro commented 1 year ago

It seems it is not learning, you should see the logloss going down consistently, like in the notebook: https://github.com/recommenders-team/recommenders/blob/main/examples/00_quick_start/sequential_recsys_amazondataset.ipynb

A lot of the times, this error comes because either the data is not set right or the data has no signal. I would recommend to start from the current example in the repo and replace with the data you have until you get a reducing loss.