My solution for the Instacart Market Basket Analysis competition hosted on Kaggle.
The dataset is an open-source dataset provided by Instacart (source)
This anonymized dataset contains a sample of over 3 million grocery orders from more than 200,000 Instacart users. For each user, we provide between 4 and 100 of their orders, with the sequence of products purchased in each order. We also provide the week and hour of day the order was placed, and a relative measure of time between orders.
Below is the full data schema (source)
orders
(3.4m rows, 206k users):
order_id
: order identifieruser_id
: customer identifiereval_set
: which evaluation set this order belongs in (seeSET
described below)order_number
: the order sequence number for this user (1 = first, n = nth)order_dow
: the day of the week the order was placed onorder_hour_of_day
: the hour of the day the order was placed ondays_since_prior
: days since the last order, capped at 30 (with NAs fororder_number
= 1)
products
(50k rows):
product_id
: product identifierproduct_name
: name of the productaisle_id
: foreign keydepartment_id
: foreign key
aisles
(134 rows):
aisle_id
: aisle identifieraisle
: the name of the aisle
deptartments
(21 rows):
department_id
: department identifierdepartment
: the name of the department
order_products__SET
(30m+ rows):
order_id
: foreign keyproduct_id
: foreign keyadd_to_cart_order
: order in which each product was added to cartreordered
: 1 if this product has been ordered by this user in the past, 0 otherwisewhere
SET
is one of the four following evaluation sets (eval_set
inorders
):
"prior"
: orders prior to that users most recent order (~3.2m orders)"train"
: training data supplied to participants (~131k orders)"test"
: test data reserved for machine learning competitions (~75k orders)
The task is to predict which products a user will reorder in their next order. The evaluation metric is the F1-score between the set of predicted products and the set of true products.
The task was reformulated as a binary prediction task: Given a user, a product, and the user's prior purchase history, predict whether or not the given product will be reordered in the user's next order. In short, the approach was to fit a variety of generative models to the prior data and use the internal representations from these models as features to second-level models.
The first-level models vary in their inputs, architectures, and objectives, resulting in a diverse set of representations.
The second-level models use the internal representations from the first-level models as features.
The final reorder probabilities are a weighted average of the outputs from the second-level models. The final basket is chosen by using these probabilities and choosing the product subset with maximum expected F1-score.
64 GB RAM and 12 GB GPU (recommended), Python 2.7
Python packages: