massquantity / LibRecommender

Versatile End-to-End Recommender System
https://librecommender.readthedocs.io/
MIT License
352 stars 62 forks source link

How to ensure the model being trained on all the users and items? #423

Open jpzhangvincent opened 9 months ago

jpzhangvincent commented 9 months ago

Is there a data split function to make sure the model being trained on all the users and items? Instead of random split, we want to make the all the embedings learned and available for all the users and items during training, while we can still evaluate on the validation/test set.

From the documentation: split_by_ratio. For each user, assign a certain ratio of items to the test data. split_by_num. For each user, assign a certain number of items to the test data.

I'm not sure that would serve for my purpose. But correct me if I'm wrong!

Another question, if we can't cover all the users and items from the training, do we still need to re-train the model with the same parameters on all the combined data(i.e train + validation + test) for the latest model inference in production? I'm a bit confused on when to retrain to make sure all the items and users are covered.

massquantity commented 9 months ago

After splitting, you can manually move all the oov(out-of-vocabulary) users/items from eval_data to train_data.

train_data, eval_data = split_by_ratio(...)
train_data, eval_data = move_oov_to_train(train_data, eval_data)

def move_oov_to_train(train_data: pd.DataFrame, eval_data: pd.DataFrame):
    unique_users = np.unique(train_data["user"])
    unique_items = np.unique(train_data["item"])
    eval_user_oov = np.isin(eval_data["user"].to_numpy(), unique_users, invert=True)
    eval_item_oov = np.isin(eval_data["item"].to_numpy(), unique_items, invert=True)
    oov_mask = eval_user_oov | eval_item_oov
    train_data = pd.concat([train_data, eval_data[oov_mask]], axis=0)
    eval_data = eval_data[~oov_mask]
    return train_data, eval_data

It is better to retrain using all the combined data after initial training.

jpzhangvincent commented 9 months ago

It seems like it could be a nice extension to implement in the split_ function.

I have a few more question about the re-training process.

For the retraining using the combined data, we don't need to split the data anymore so we can just set the eval_data=None, in the model.fit() function, correct?

Do we have to save the initial trained model and dataInfo (from train-test split) first and call DatasetFeat.merge_ and model.rebuild functions like this example - https://github.com/massquantity/LibRecommender/blob/master/examples/model_retrain_example.py? It seems a bit unnecessary for me. How to do it in a streamlined workflow without saving the intermediate artifacts since we only want to save the final retrained model?

massquantity commented 9 months ago

OK, I'll consider adding this extension.

"retrain" in this library means incremental training, which is used when one gets some new data potentially several days later. So the previous trained model has to be saved.

In your case, the "retrain" feature is not needed. What i meant in the previous reply is using all the data(train + validation + test) and training again with the same hyper-parameters after the initial training.

jpzhangvincent commented 8 months ago

@massquantity Another question, since for the ranking task with implicit data which only contains positive examples, how can we still get the learned embedding for those items that are not being included in the data (because they are not being transacted)? Similar for the user embedding for those users that don't appear in the transaction data? Should we include the negative samples with label 0 for those items if we want to learn their embeddings? How does the library handle id feature? (especially new items/users)

massquantity commented 7 months ago

The embeddings of users/items with no transaction can't be learned, since they have no data to be trained. The library applies mean embedding, so all these users will have the same embedding.


user_embeds = model.get_user_embedding()  # shape: (n_users, embed_size)
no_transaction_user_embed = np.mean(user_embeds, axis=0)
jpzhangvincent commented 7 months ago

Hmm I thought the model can still infer the embeddings for the new or long-tail items if we pass the features that are generalizable across the item/user set.. since the two tower model can learn the representation/interaction of the features as a content based approach.. Correct me if I’m wrong? Are you saying we can’t use the explicit data(with both positive and negative samples) for the ranking task?

massquantity commented 7 months ago

OK, i get it. You didn't mention that you can use other features in your data. This is supported by using Dynamic Embedding Generation. However, this approach only supports new users. Maybe the item embedding generation will be added in the future.