RS_multi_modal_user_interactions

This repository contains the data and source code for Dataset and Models for Item Recommendation Using Multi-Modal User Interactions.

Requirements

Python
NumPy
Pandas
TensorFlow
Scikit-learn
Pickle
Matplotlib

Dataset

We publish a real-world dataset from the insurance domain with multi-modal user interactions that can be used in recommendation models. The dataset is anonymized.
Download the files: data_users.csv, data_conversations_keyword.csv, data_sessions.csv, data_purchase_events.csv, data_post_filter.csv
and the folder: data_conversations_embedding

Dataset Format

There are 6 different datasets.

data_users.csv

This data contains the users. Each user has had one or more purchase events with conversations and/or web sessions prior to that purchase. The data contains 5 columns:

user_id. The ID of a user.
purchase_event_id. The ID of a purchase event.
conversation_id. The ID of a conversation.
session_id. The ID of a web session.
event_number. A number specifying the order of conversations/web sessions.

data_conversations_keyword.csv

This data contains the conversations that the user had prior to the user's purchase event. Each conversation consists of multiple sentences represented with keywords. The data contains 4 columns:

conversation_id. The ID of a conversation.
sentence_number. A number specifying the order of sentences.
sentence_speaker. The speaker of the sentence (user or agent).
keywords. List with the IDs of the keywords in the sentence.

data_conversationsembedding(1-107).csv

This data is split into multiple files due to file size limitations. The data contains the conversations that the user had prior to the user's purchase event. Each conversation consists of multiple sentences represented with text embeddings. The data contains 771 columns:

conversation_id. The ID of a conversation.
sentence_number. A number specifying the order of sentences.
sentence_speaker. The speaker of the sentence (user or agent).
embedding_1 - embedding_768. Text embeddings computed with a pre-trained language-specific BERT model.

data_sessions.csv

This data contains the web sessions that the user made prior to the user's purchase event. Each web session consists of multiple actions. The data contains 3 columns:

session_id. The ID of a web session.
action_number. A number specifying the order of actions.
action_tags. List with the IDs of the section, object and type of an action.

data_purchase_events.csv

This data contains the purchase events. Each event consists of one or more item purchases made by the same user. The data contains 2 columns:

purchase_event_id. The ID of a purchase event.
item_id. The ID of an item.

data_post_filter.csv

This data contains the items that were possible for the user to buy at the time of the user's purchase event. The data contains 2 columns:

purchase_event_id. The ID of a purchase event.
item_id. The ID of an item.

Usage

Train and validate the models using
model_popular.py
model_conversation.py
model_session.py
model_knowledge_distillation.py
model_generative_imputation_step_1.py
model_generative_imputation_step_2.py
model_generative_imputation_step_3.py
model_neutral_imputation.py
model_keyword.py
model_latent_feature.py
model_relative_representation_step_1.py
model_relative_representation_step_2.py
model_relative_representation_step_3.py
Evaluate the models over the test set using
evaluation_popular.py
evaluation_conversation.py
evaluation_session.py
evaluation_late_fusion.py
evaluation_knowledge_distillation.py
evaluation_generative_imputation.py
evaluation_neutral_imputation.py
evaluation_keyword.py
evaluation_latent_feature.py
evaluation_relative_representation.py

simonebbruun / RS_multi_modal_user_interactions

readme