maciejkula / spotlight

Deep recommender models using PyTorch.
MIT License
2.97k stars 421 forks source link

Evaluation of Implicit sequential model throws ValueError #161

Open impaktor opened 5 years ago

impaktor commented 5 years ago

Hi!

I'm trying to train an implicit sequential model on click stream data, but as soon as I try to evaluate (e.g. using MRR, or Precision & Recall) after having trained the model, it throws an error:

mrr = spotlight.evaluation.mrr_score(implicit_sequence_model, test, train)

ValueErrorTraceback (most recent call last)
<ipython-input-78-349343a26e9b> in <module>
----> 1 mrr = spotlight.evaluation.mrr_score(implicit_sequence_model, test, train)

~/.local/lib/python3.7/site-packages/spotlight/evaluation.py in mrr_score(model, test, train)
     45             continue
     46
---> 47         predictions = -model.predict(user_id)
     48
     49         if train is not None:

~/.local/lib/python3.7/site-packages/spotlight/sequence/implicit.py in predict(self, sequences, item_ids)
    316
    317         self._check_input(item_ids)
--> 318         self._check_input(sequences)
    319
    320         sequences = torch.from_numpy(sequences.astype(np.int64).reshape(1, -1))

~/.local/lib/python3.7/site-packages/spotlight/sequence/implicit.py in _check_input(self, item_ids)
    188
    189         if item_id_max >= self._num_items:
--> 190             raise ValueError('Maximum item id greater '
    191                              'than number of items in model.')
    192

ValueError: Maximum item id greater than number of items in model.

Perhaps the error is obvious, but I can't pinpoint what I'm doing wrong, so below I'll describe as concisely as possible, what I'm doing.

Comparison of experimental with synthetic data

I tried generating synthetic data and use that instead of my experimental data, and then it works. This lead me to compare the data structure of the synthetic data with my experimental:

Table 1: Synthetic data with N=100 unique users, M=1k unique items, and Q=10k interactions
user_id item_id timestamp
0 958 1
0 657 2
0 172 3
1 129 4
1 . 5
1 . 6
. . .
. . .
. . .
. . .
N . Q-2
N . Q-1
N 459 Q
Table 2: Experimental data, N=2.5M users, M=20k items, Q=14.8M interactions
user_id item_id timestamp
725397 3992 0
2108444 10093 1
2108444 10093 2
1840496 15616 3
1792861 16551 4
1960701 16537 5
1140742 6791 6
2074022 4263 .
2368959 19258 .
2368959 17218 .
. . .
. . Q-1
. . Q
  1. Both data sets have users indexed from [0..N-1], but my experimental is not sorted on user_ids as is the case for the synthetic data.

  2. Both data sets have item_ids indexed from [1..M], yet it only throws the "ValueError: Maximum item id greater than number of items in model." for my experimental data.

  3. I've re-shaped my timestamps to be just the data frame index after sorting on time, so this is also as in the synthetic data set. (Previously my timestamps were in seconds since 1970 of the event, and some events were simultaneous, i.e. order arbitrary/degenerate state.

Code for processing the experimental data:

# pandas dataframe with unique string identifier for users ('session_id'), 
# and 'Article number' for item_id, and 'timestamp' for event
df = df.sort_values(by=['timestamp']).reset_index(drop=True)

# encode string identifiers for users and items to integer values:
from sklearn import preprocessing
le_usr = preprocessing.LabelEncoder() # user encoder
le_itm = preprocessing.LabelEncoder() # item encoder

# shift item_ids with +1 (but not user_ids):
item_ids = (le_itm.fit_transform(df['Article number']) + 1).astype('int32')
user_ids = (le_usr.fit_transform(df['session_id'])     + 0).astype('int32')

from spotlight.interactions import Interactions
implicit_interactions = Interactions(user_ids, item_ids, timestamps=df.index.values)

from spotlight.cross_validation import user_based_train_test_split, random_train_test_split
train, test = random_train_test_split(implicit_interactions, 0.2)

Code for training the model:

from spotlight.sequence.implicit import ImplicitSequenceModel
sequential_interaction = train.to_sequence()
implicit_sequence_model = ImplicitSequenceModel(use_cuda=True, n_iter=10, loss='pointwise', representation='pooling')
implicit_sequence_model.fit(sequential_interaction, verbose=True)

import spotlight.evaluation
mrr = spotlight.evaluation.mrr_score(implicit_sequence_model, test, train)

Questions on input format:

Here are some questions I thought might pinpoint the error, in where my data might differ from the synthetic data set:

  1. Is there any purpose, or even harm, to include users with only a single interaction?

  2. Does the model allow a user have multiple events with the same timestamp-value?

  3. As long as (userid,itemid,timestamp) triplets pair up, does row-ordering matter?

maciejkula commented 5 years ago

There is nothing obviously wrong with the code you posted, thanks for doing the analysis.

Before you start the evaluation routine on your real data, can you compare the number of items in your train and test data? They should be the same.

  1. I don't think it'll help much.
  2. Sure.
  3. No.

On Fri, May 10, 2019, 07:30 Karl F notifications@github.com wrote:

Hi!

I'm trying to train an implicit sequential model on click stream data, but as soon as I try to evaluate (e.g. using MRR, or Precision & Recall) after having trained the model, it throws an error:

mrr = spotlight.evaluation.mrr_score(implicit_sequence_model, test, train)

ValueErrorTraceback (most recent call last)

in ----> 1 mrr = spotlight.evaluation.mrr_score(implicit_sequence_model, test, train) ~/.local/lib/python3.7/site-packages/spotlight/evaluation.py in mrr_score(model, test, train) 45 continue 46 ---> 47 predictions = -model.predict(user_id) 48 49 if train is not None: ~/.local/lib/python3.7/site-packages/spotlight/sequence/implicit.py in predict(self, sequences, item_ids) 316 317 self._check_input(item_ids) --> 318 self._check_input(sequences) 319 320 sequences = torch.from_numpy(sequences.astype(np.int64).reshape(1, -1)) ~/.local/lib/python3.7/site-packages/spotlight/sequence/implicit.py in _check_input(self, item_ids) 188 189 if item_id_max >= self._num_items: --> 190 raise ValueError('Maximum item id greater ' 191 'than number of items in model.') 192 ValueError: Maximum item id greater than number of items in model. Perhaps the error is obvious, but I can't pinpoint what I'm doing wrong, so below I'll describe as concisely as possible, what I'm doing. Comparison of experimental with synthetic data I tried generating synthetic data and use that instead of my experimental data, and then it works. This lead me to compare the data structure of the synthetic data with my experimental: Table 1: Synthetic data with N=100 unique users, M=1k unique items, and Q=10k interactions user_id item_id timestamp 0 958 1 0 657 2 0 172 3 1 129 4 1 . 5 1 . 6 . . . . . . . . . . . . N . Q-2 N . Q-1 N 459 Q Table 2: Experimental data, N=2.5M users, M=20k items, Q=14.8M interactions user_id item_id timestamp 725397 3992 0 2108444 10093 1 2108444 10093 2 1840496 15616 3 1792861 16551 4 1960701 16537 5 1140742 6791 6 2074022 4263 . 2368959 19258 . 2368959 17218 . . . . . . Q-1 . . Q 1. Both data sets have users indexed from [0..N-1], but my experimental is not sorted on user_ids as is the case for the synthetic data. 2. Both data sets have item_ids indexed from [1..M], yet it only throws the "ValueError: Maximum item id greater than number of items in model." for my experimental data. 3. I've re-shaped my timestamps to be just the data frame index after sorting on time, so this is also as in the synthetic data set. (Previously my timestamps were in seconds since 1970 of the event, and some events were simultaneous, i.e. order arbitrary/degenerate state. Code for processing the experimental data: # pandas dataframe with unique string identifier for users ('session_id'), # and 'Article number' for item_id, and 'timestamp' for event df = df.sort_values(by=['timestamp']).reset_index(drop=True) # encode string identifiers for users and items to integer values:from sklearn import preprocessing le_usr = preprocessing.LabelEncoder() # user encoder le_itm = preprocessing.LabelEncoder() # item encoder # shift item_ids with +1 (but not user_ids): item_ids = (le_itm.fit_transform(df['Article number']) + 1).astype('int32') user_ids = (le_usr.fit_transform(df['session_id']) + 0).astype('int32') from spotlight.interactions import Interactions implicit_interactions = Interactions(user_ids, item_ids, timestamps=df.index.values) from spotlight.cross_validation import user_based_train_test_split, random_train_test_split train, test = random_train_test_split(implicit_interactions, 0.2) Code for training the model: from spotlight.sequence.implicit import ImplicitSequenceModel sequential_interaction = train.to_sequence() implicit_sequence_model = ImplicitSequenceModel(use_cuda=True, n_iter=10, loss='pointwise', representation='pooling') implicit_sequence_model.fit(sequential_interaction, verbose=True) import spotlight.evaluation mrr = spotlight.evaluation.mrr_score(implicit_sequence_model, test, train) Questions on input format: Here are some questions I thought might pinpoint the error, in where my data might differ from the synthetic data set: 1. Is there any purpose, or even harm, to include users with only a single interaction? 2. Does the model allow a user have multiple events with the same timestamp-value? 3. As long as (userid,itemid,timestamp) triplets pair up, does row-ordering matter? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub , or mute the thread .
impaktor commented 5 years ago

Thanks for fast reply!

Before you start the evaluation routine on your real data, can you compare the number of items in your train and test data? They should be the same.

They're the same as far as I can tell, this is the output after I've run random_train_test_split:

In [6]: test
Out[6]: <Interactions dataset (2517443 users x 20861 items x 2968924 interactions)>

In [7]: train
Out[7]: <Interactions dataset (2517443 users x 20861 items x 11875692 interactions)>

I've also tried using either user_based_train_test_split(), and random_train_test_split(), but result always ends with the ValueError thrown. I've tried using 'pointwise' or 'adaptive_hinge', just to see if that would change anything, but naturally it did naught; and model training seems to work fine either way.

But indeed the actual number of items is one less (20860, see below) than the interaction dataset thinks (20861, see above), for some reason:

In [8]: print(len(np.unique(item_ids)), min(item_ids), max(item_ids))
20860 1 20860

In [15]: len(item_ids) - (2968924 + 11875692)
Out[15]: 0

Is this some how related to me doing a +1 to all item_ids in the code of my original post? (repeated below)

# shift item_ids with +1 (but not user_ids):
item_ids = (le_itm.fit_transform(df['Article number']) + 1).astype('int32')

If I don't do this, I will have a zero indexed item_vector and that will trigger an assert/error check, if I remember correctly.

maciejkula commented 5 years ago

One explanation for why this would happen is if I didn't propagate the total number of items correctly across train/test splits and sequential interaction conversion (the total number of items in the model must be the higher of the maximum item id in train/test). However, I don't see anything wrong with the code.

The invariant that needs to be upheld is train.num_items == test.num_items == model._num_items (and item_ids.max() < model._num_items).

I think unless you can provide a snippet that I can run that has the same problem I won't be able to help further.

(By the way, random train/test split doesn't make any sense for sequential models: use the user-based split.)

impaktor commented 4 years ago

Hi @maciejkula

After 6 months, I've now revisited this, and I believe I know exactly how to trigger this bug.

(Quick recap of above: Evaluating my ImplicitSequenceModel worked with synthetic data, but not with my "real" data, as I got error: ValueError: Maximum item id greater than number of items in model. yet I check this on both train and test, and all indices look to be correct)

I provide code that transforms the synthetic data to my use case, which triggers the bug.

The following code will trigger the bug:

from spotlight.cross_validation import user_based_train_test_split
from spotlight.datasets.synthetic import generate_sequential
from spotlight.evaluation import sequence_mrr_score
from spotlight.evaluation import mrr_score
from spotlight.sequence.implicit import ImplicitSequenceModel

trigger_crash = True
if trigger_crash:
    n_items = 100
else:
    n_items = 1000

dataset = generate_sequential(num_users=1000,
                              num_items=n_items,
                              num_interactions=10000,
                              concentration_parameter=0.01,
                              order=3)

train, test = user_based_train_test_split(dataset)

train_seq = train.to_sequence()

model = ImplicitSequenceModel(n_iter=3,
                              representation='cnn',
                              loss='bpr')
model.fit(train_seq, verbose=True)

# this always works
test_seq = test.to_sequence()
mrr_seq = sequence_mrr_score(model, test_seq)
print(mrr_seq)

# using mrr_score (or precision_recall) with num_items < num_users
# triggers crash:
mrr = mrr_score(model, test)
print(mrr)

I.e. if num_items < num_users the mrr_score nor precision_recall_score works, however, sequence_mrr_score and sequence_precision_recall_score works fine.

Question is:

  1. Am I wrong in trying to use the non sequence_* version of these evaluation metrics for an implicit sequence model?

  2. If so, is it just luck that they work when items > users?