recommenders-team / recommenders

Best Practices on Recommendation Systems
https://recommenders-team.github.io/recommenders/intro.html
MIT License
18.76k stars 3.07k forks source link

[BUG] NRMS Logloss Failure #1803

Open Bhammin opened 2 years ago

Bhammin commented 2 years ago

Description

When using the notebook nrms_MIND.ipynb if I try to include logloss as a metric (update nrms.yaml), I am getting a failure when training the model. Please see below image of the exact failure when calling model.fit(train_news_file, train_behaviors_file, valid_news_file, valid_behaviors_file)

Screen Shot 2022-07-25 at 10 51 11 AM

The failure happens in cal_metric(labels, preds, metrics). I think the issue is that the field preds data type is a list of numpy arrays with different lengths. For example: [[.1,.2],[.1,.1,.2],[.8]].

I was able to calculate logloss by making the following change to cal_metric(labels, preds, metrics)

 elif metric == "logloss":
 # preds and labels are a list of numpy arrays with different lens
 # note all lengths match between labels and preds inner and outer lists

 # log loss doesn't like the different lens of the np array 
 # convert list of array to a single list
 predSingleList = [i for p in preds for i in p]
 labelSingleList = [i for lab in labels for i in lab]

 # avoid logloss nan (division by zero).
 predSingleList = [max(min(p, 1.0 - 10e-12), 10e-12) for p in predSingleList]

 logloss = log_loss(labelSingleList, predSingleList)
 res["logloss"] = round(logloss, 4)`

Since I am only playing with NRMS model at the moment, I'm not sure if this fix would work for all models.

In which platform does it happen?

Jupyter Lab running in Ubuntu 20.04.4 LTS (Focal Fossa). Using Python version 3.8.10 with tensorflow version 2.8.0.

How do we replicate the issue?

Using https://github.com/microsoft/recommenders/blob/main/examples/00_quick_start/nrms_MIND.ipynb, update the hyper parameters metric to include logloss. hparams should like this HParams object with values {'support_quick_scoring': True, 'dropout': 0.2, 'attention_hidden_dim': 200, 'head_num': 20, 'head_dim': 20, 'filter_num': 200, 'window_size': 3, 'vert_emb_dim': 100, 'subvert_emb_dim': 100, 'gru_unit': 400, 'type': 'ini', 'user_emb_dim': 50, 'learning_rate': 0.0001, 'optimizer': 'adam', 'epochs': 3, 'batch_size': 32, 'show_step': 10, 'title_size': 30, 'his_size': 50, 'data_format': 'news', 'npratio': 4, 'metrics': ['logloss', 'group_auc', 'mean_mrr', 'ndcg@5;10'], 'word_emb_dim': 300, 'model_type': 'nrms', 'loss': 'cross_entropy_loss', 'wordEmb_file': './data/nrms/mind-small/utils/embedding.npy', 'wordDict_file': './data/nrms/mind-small/utils/word_dict.pkl', 'userDict_file': './data/nrms/mind-small/utils/uid2index.pkl'}

Expected behavior (i.e. solution)

The validation logloss should be calculated.

Other Comments

Thank you for providing quick start guides for many models!

canonrock16 commented 1 year ago

Same in NAML model. I resolved like below.

elif metric == "logloss":
# avoid logloss nan
# preds = [max(min(p, 1.0 - 10e-12), 10e-12) for p in preds]
preds = [np.clip(p, 10e-12, 1.0 - 10e-12) for p in preds]
preds = np.concatenate(preds)
labels = np.concatenate(labels)

logloss = log_loss(np.asarray(labels), np.asarray(preds))
res["logloss"] = round(logloss, 4)