ibayer / fastFM

fastFM: A Library for Factorization Machines
http://ibayer.github.io/fastFM
Other
1.07k stars 206 forks source link

Expected behavior of bpr.FMRecommender on a trivial case #71

Open abirkmanis opened 8 years ago

abirkmanis commented 8 years ago

I really hope I am not missing something obvious here. Using fastFM (0.2.6) on MacOS.

import numpy as np
import scipy as sp
from fastFM import bpr

# learn matching two binary features

#1-hot encoding
x=sp.sparse.csc_matrix(np.array([
    [ 0.,  1.,  0.,  1.],
    [ 1.,  0.,  1.,  0.],
    [ 0.,  1.,  1.,  0.],
    [ 1.,  0.,  0.,  1.]
    ]))

# first two observations are better than the last two
compares=np.array([
    [  0.,   2.],
    [  0.,   3.],
    [  1.,   2.],
    [  1.,   3.]
    ])

# fit
fm=bpr.FMRecommender(n_iter=1000,
   init_stdev=0.01, l2_reg_w=.5, l2_reg_V=.5, rank=2,
   step_size=.001, random_state=11)
fm.fit(x,compares)

# predict
p=fm.predict(x)

# expecting 0 and 1 before 2 and 3
print(np.argsort(-p))
print(p)

What I get instead is:

[3 0 1 2]
[-0.0020049  -0.00487495 -0.00704542 -0.00022524]
abirkmanis commented 8 years ago

I've just realized that n_iter=1000 is not sufficient for step_size=.001 with l2_reg_V=.5. n_iter=1000, l2_reg_V=.1, step_size=.01 works as expected. Is it possible to have some recommendations on parameters? Could the model somehow report if it's far from convergence?

Also, I would expect it to learn w close to zero and V positive/negative for odd features and negative/positive for even features (or vice versa), but no matter how I fiddle with parameters w does not go to 0, and V is often NOT the checkerboard pattern I would expect.

ibayer commented 8 years ago

Is it possible to have some recommendations on parameters?

Sorry, the parameter depend very much on the data set.

Could the model somehow report if it's far from convergence?

The problem is non convex and I'm not aware of any other property that can be used instead. Your best bet is to monitor the loss on the training data.

Also, I would expect it to learn w close to zero and V positive/negative for odd features and negative/positive for even features (or vice versa), but no matter how I fiddle with parameters w does not go to 0, and V is often NOT the checkerboard pattern I would expect.

Can you expand this a bit more?

abirkmanis commented 8 years ago

Can you expand this a bit more?

Sure. I meant that the training data set maximizes target iff feature 0 has the same value as feature 2, and feature 1 has the same value as feature 3 (0 and 1 are 1-hot encoding of a binary choice, as are 2 and 3, and the goal is to coordinate the choices).

One natural value of V is to have [+a, -a] for both even features, and [-a, +a] for both odd features, so that both even by even and odd by odd products are high, while odd by even is low (obviously, I am not looking for exact match of values, just for this pattern of signs). This kind of V values looks like a checkerboard, but I get it very infrequently - maybe the basin of this optimum is too narrow and easy to miss, no idea. But even when I hit this kind of V, w is still not going to 0, even if its L2 is much higher that that of V_.

Here is about the best I could get:

fm=bpr.FMRecommender(n_iter=100,
   init_stdev=0.01, l2_reg_w=1, l2_reg_V=0, rank=2,
   step_size=.1, random_state=11)
fm.fit(x,compares)

# predict
p=fm.predict(x)

# expecting 0 and 1 before 2 and 3
print(np.argsort(-p))
print(p)
print(fm.V_)
print(fm.w_)
[1 0 2 3]
[ 0.62021172  0.70173661 -0.66102439 -0.66245436]
[[ 0.55268345 -0.54032328  0.54973946 -0.5434598 ]
 [-0.60897218  0.59541005 -0.61684187  0.58755395]]
[ 0.00898714 -0.00999153  0.01327802 -0.01327625]

V has the checkerboard pattern, but w is not quite at 0, though l2_reg_w=1 and l2_reg_V=0. Increasing n_iter helps only so much - and again, this is the simplest use case I can imagine, I would hate having n_iter over 1000 for realistic cases.

ibayer commented 7 years ago

Okay, lets translate your numeric example to plain words.

We have a dataset with pairs of context and item. The contexts is described by a color (blue/red) which we one-hot encode (first two features in design matrix). The items are described by their size (small/big) also one-hot encoded (feature column 3 and 4).

This is matrix x from above.

[(red  | big),
 (blue | small),
 (red  | small),
 (blue | big)]

We further have the comparisons compares matrix from above.

(red | big) > (red  | small)
(red | big) > (blue | big)
(blue | small) > (red | small)
(blue | small) > (blue | big)

Let us use V(blue) to denote the latent representation for feature blue. We expect therefore higher values for the dot products <V(red), V(big)> and <V(blue), V(small)> as for <V(blue), V(big)> and <V(blue), V(small)> since we are learning a comparison function.

Solution 1: signed checker-board

for rank = 2

V(blue)  = [+ -]
V(red)   = [- +]
V(small) = [+ -]
V(big)   = [- +]

this seems to work

<V(red), V(big)>  = (- * -) + (+ * +) = 2 +
<V(red), V(small)> = (- * +) + (+ * -) = 2 -
<V(blue), V(big)>   = (+ * -) + (- * +) = 2 -
<V(blue), V(small)> = (+ * +) + (- * -) = 2 +

Solution 1: orthogonal checker-board

for rank = 2

V(blue)  = [0 1]
V(red)   = [1 0]
V(small) = [0 1]
V(big)   = [1 0]

this seems to work too

<V(red), V(big)>  = 1 + 0
<V(red), V(small)> = 0 + 0
<V(blue), V(big)>   = 0 + 0
<V(blue), V(small)> = 0 + 1

I would say we have two attraction points. That could explain why you don't see the signed checker-board often especially considering that the orthogonal checker-board might be favoured by the L2 norm regularization. I'm ignoring w to simply the discussion (which corresponds to choosing a very high value for l2_reg_w).

I'm not sure this small example can tell us much about the expected convergence rate on a real dataset.

ibayer commented 7 years ago

@abirkmanis Can we change the title to:

Expected behavior of bpr.FMRecommender on a trivial case

rheras commented 7 years ago

Hi all, very useful and clarifying this issue/thread, really there are not many examples around using fm library.

Please, my question after reviewing the examples above is how can I build the 'compare' matrix. I mean, in one of the examples the compares matrix has been defined as:

compares=np.array([ [ 0., 2.], [ 0., 3.], [ 1., 2.], [ 1., 3.] ])

and later it is included calling in the fit method:

fm.fit(x,compares)

I have read some papers/webs, they speak about elements of the matrix, how to compare between one and the following, but I assume there is an algorithm, scipy method or whatever for building such matrix, isn't it? Please I would be grateful if somebody could help me, posting any reference or example about it,

thanks in advance, regards.

yaro-slawa commented 6 years ago

Hello everyone,

I'd like to join to the comment above and ask about how to generate compares matrix. I thought that I could use fastFM Ranking module to apply it to the task for implicit feedback recommender system, where I have only positive examples, without generating negative examples. But I confused with this compares matrix. Thank you for any help.