hidasib / GRU4Rec

GRU4Rec is the original Theano implementation of the algorithm in "Session-based Recommendations with Recurrent Neural Networks" paper, published at ICLR 2016 and its follow-up "Recurrent Neural Networks with Top-k Gains for Session-based Recommendations". The code is optimized for execution on the GPU.
Other
747 stars 222 forks source link

Get predictions with predict_next_batch #9

Closed loretoparisi closed 6 years ago

loretoparisi commented 6 years ago

I'm trying to get prediction from the evaluated gru model. My first attempt is like

batch_size=100
iters = np.arange(batch_size).astype(np.int32)
in_idx = np.zeros(batch_size, dtype=np.int32)
predict_for_item_ids = None # no sampling
preds = gru.predict_next_batch(iters, in_idx, predict_for_item_ids, batch_size)
preds.fillna(0, inplace=True)

That is quite like the evaluation code when it is not sampling the data here:

out_idx = test_data.ItemId.values[start_valid+i+1]
            if sampled_items:
                uniq_out = np.unique(np.array(out_idx, dtype=np.int32))
                preds = pr.predict_next_batch(iters, in_idx, np.hstack([items, uniq_out[~np.in1d(uniq_out,items)]]), batch_size)
            else:
                preds = pr.predict_next_batch(iters, in_idx, None, batch_size) #TODO: Handling sampling?
            preds.fillna(0, inplace=True)

I'm not sure since the predict_next_batch function has this sign

    def predict_next_batch(self, session_ids, input_item_ids, predict_for_item_ids=None, batch=100):

So I need session_ids and input_item_ids from input dataset, right?

loretoparisi commented 6 years ago

[UPDATE] With the code above I get an error

Epoch0  loss: 0.979422
Measuring Recall@19 and MRR@19
Recall@20: 0.007334788364521092
MRR@20: 0.0011454641877874099
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/theano/compile/function_module.py", line 884, in __call__
    self.fn() if output_subset is None else\
IndexError: Index out of bounds.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run_rsc15.py", line 54, in <module>
    preds = gru.predict_next_batch(iters, in_idx, predict_for_item_ids, batch_size)
  File "../../gru4rec.py", line 600, in predict_next_batch
    preds = np.asarray(self.predict(in_idxs)).T
  File "/usr/local/lib/python3.5/dist-packages/theano/compile/function_module.py", line 898, in __call__
    storage_map=getattr(self.fn, 'storage_map', None))
  File "/usr/local/lib/python3.5/dist-packages/theano/gof/link.py", line 325, in raise_with_op
    reraise(exc_type, exc_value, exc_trace)
  File "/usr/local/lib/python3.5/dist-packages/six.py", line 685, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.5/dist-packages/theano/compile/function_module.py", line 884, in __call__
    self.fn() if output_subset is None else\
IndexError: Index out of bounds.
Apply node that caused the error: GpuAdvancedSubtensor1(<GpuArrayType<None>(float64, (False, False))>, GpuContiguous.0)
Toposort index: 9
Inputs types: [GpuArrayType<None>(float64, (False, False)), GpuArrayType<None>(int64, (False,))]
Inputs shapes: [(37483, 3), (50,)]
Inputs strides: [(24, 8), (8,)]
Inputs values: ['not shown', 'not shown']
Outputs clients: [[InplaceGpuDimShuffle{1,0}(GpuAdvancedSubtensor1.0)]]

that is a IndexError: Index out of bounds..

loretoparisi commented 6 years ago

I did a step forward in selecting session_ids, input_item_ids.

    batch_size=10
    session_ids = valid.SessionId.values[0:batch_size]
    input_item_ids = valid.ItemId.values[0:batch_size]
    predict_for_item_ids = None

    print('session_ids: {}'.format(session_ids))
    print('input_item_ids: {}'.format(input_item_ids))
    print('uniq_out: {}'.format(uniq_out))
    print('predict_for_item_ids: {}'.format(predict_for_item_ids))

    preds = gru.predict_next_batch(session_ids, input_item_ids, predict_for_item_ids, batch_size)
    preds.fillna(0, inplace=True)
    print('Preds: {}'.format(preds))

So far I get the dimension error for the predict_for_item_ids doing

out_idx = valid.ItemId.values[0:batch_size]
uniq_out = np.unique(np.array(out_idx, dtype=np.int32))
predict_for_item_ids = np.hstack([data, uniq_out[~np.in1d(uniq_out,data)]])    

so I'm just using predict_for_item_ids=None.

Given this I get

session_ids: [0 1 2 3 4 5 6 7 8 9]
input_item_ids: [214696432 214857030 214858854 214858854 214836819 214696434 214857570
 214858847 214859094 214690730]
uniq_out: [214690730 214696432 214696434 214836819 214857030 214857570 214858847
 214858854 214859094]
predict_for_item_ids: None

and a prediction ndarray of [37483 rows x 10 columns], that is like:

Preds:                   0         1         2         3         4         5  \
214536502  0.016009  0.016079  0.005016  0.005016  0.029079  0.022866   
214536500  0.010071  0.010108  0.004129  0.004129  0.017137  0.013778   
214536506  0.008890  0.008934  0.001951  0.001951  0.017141  0.013218 

Not sure about the dimension of this array actually.

jellchou commented 6 years ago

@loretoparisi Preds: 0 1 2 3 4 5 \ 214536502 0.016009 0.016079 0.005016 0.005016 0.029079 0.022866
214536500 0.010071 0.010108 0.004129 0.004129 0.017137 0.013778
214536506 0.008890 0.008934 0.001951 0.001951 0.017141 0.013218 the column name is its id in batch, you need to rename this id to specific session_id.

loretoparisi commented 6 years ago

@jellchou thank you!