Open jpzhangvincent opened 9 months ago
After splitting, you can manually move all the oov(out-of-vocabulary) users/items from eval_data to train_data.
train_data, eval_data = split_by_ratio(...)
train_data, eval_data = move_oov_to_train(train_data, eval_data)
def move_oov_to_train(train_data: pd.DataFrame, eval_data: pd.DataFrame):
unique_users = np.unique(train_data["user"])
unique_items = np.unique(train_data["item"])
eval_user_oov = np.isin(eval_data["user"].to_numpy(), unique_users, invert=True)
eval_item_oov = np.isin(eval_data["item"].to_numpy(), unique_items, invert=True)
oov_mask = eval_user_oov | eval_item_oov
train_data = pd.concat([train_data, eval_data[oov_mask]], axis=0)
eval_data = eval_data[~oov_mask]
return train_data, eval_data
It is better to retrain using all the combined data after initial training.
It seems like it could be a nice extension to implement in the split_
function.
I have a few more question about the re-training process.
For the retraining using the combined data, we don't need to split the data anymore so we can just set the eval_data=None,
in the model.fit()
function, correct?
Do we have to save the initial trained model and dataInfo (from train-test split) first and call DatasetFeat.merge_
and model.rebuild
functions like this example - https://github.com/massquantity/LibRecommender/blob/master/examples/model_retrain_example.py? It seems a bit unnecessary for me. How to do it in a streamlined workflow without saving the intermediate artifacts since we only want to save the final retrained model?
OK, I'll consider adding this extension.
"retrain" in this library means incremental training, which is used when one gets some new data potentially several days later. So the previous trained model has to be saved.
In your case, the "retrain" feature is not needed. What i meant in the previous reply is using all the data(train + validation + test) and training again with the same hyper-parameters after the initial training.
@massquantity Another question, since for the ranking task with implicit data which only contains positive examples, how can we still get the learned embedding for those items that are not being included in the data (because they are not being transacted)? Similar for the user embedding for those users that don't appear in the transaction data? Should we include the negative samples with label 0 for those items if we want to learn their embeddings? How does the library handle id
feature? (especially new items/users)
The embeddings of users/items with no transaction can't be learned, since they have no data to be trained. The library applies mean embedding, so all these users will have the same embedding.
user_embeds = model.get_user_embedding() # shape: (n_users, embed_size)
no_transaction_user_embed = np.mean(user_embeds, axis=0)
Hmm I thought the model can still infer the embeddings for the new or long-tail items if we pass the features that are generalizable across the item/user set.. since the two tower model can learn the representation/interaction of the features as a content based approach.. Correct me if I’m wrong? Are you saying we can’t use the explicit data(with both positive and negative samples) for the ranking task?
OK, i get it. You didn't mention that you can use other features in your data. This is supported by using Dynamic Embedding Generation. However, this approach only supports new users. Maybe the item embedding generation will be added in the future.
Is there a data split function to make sure the model being trained on all the users and items? Instead of random split, we want to make the all the embedings learned and available for all the users and items during training, while we can still evaluate on the validation/test set.
From the documentation: split_by_ratio. For each user, assign a certain ratio of items to the test data. split_by_num. For each user, assign a certain number of items to the test data.
I'm not sure that would serve for my purpose. But correct me if I'm wrong!
Another question, if we can't cover all the users and items from the training, do we still need to re-train the model with the same parameters on all the combined data(i.e train + validation + test) for the latest model inference in production? I'm a bit confused on when to retrain to make sure all the items and users are covered.