lyst / lightfm

A Python implementation of LightFM, a hybrid recommendation algorithm.
Apache License 2.0
4.78k stars 693 forks source link

Different latent vectors for same (test)users #543

Open momassimo opened 4 years ago

momassimo commented 4 years ago

Hi,

I created a model for a retailer with 38k customers and 36k articles and a sparsity of 0,55%. One part of the analysis is to find similiar customers, to get the "neighbours" I took the latent features of the users and checked the dot product between them (similiar to the LightFM example). In order to get a better understanding of my model I created some test customers and calculated their similarity. Two of those test customers are (besides a different name) exactly the same. Both bought the same (one) article ones and are in the same industry (a user feature). In my understanding they should have the same vectors and therefore a dot product of 1. But unfortunately the vectors are different and the dot product is -0.057281606.

Does somebody have an explanation how this can happen?

Thanks in advance!

Best, Moritz

Those are the normalized vectors: Testuser 1:

[ 0.13190027  0.26906827 -0.0822762   0.21407925  0.13211617  0.2566751
  0.17467268  0.02340734  0.09154253  0.2269812  -0.26795995  0.06671422
  0.08172801  0.1463228   0.21353354  0.12667963 -0.02653628 -0.0790253
  0.03541145  0.09163333  0.05831769  0.4006284   0.14730851  0.28267866
 -0.05757256  0.1948472  -0.08183019  0.28852767  0.09479482  0.30822176]

Testuser 2:

[-0.15500401  0.2269293   0.21974853 -0.02541173 -0.16325705 -0.13497874
 -0.17643186  0.09408431  0.04687239  0.20745914  0.27600515  0.02616096
 -0.24575633 -0.27663186 -0.10878677  0.27803454  0.08667072  0.06445353
  0.20262392  0.1274841   0.30217353 -0.04354052  0.29860505  0.30625728
  0.0359767  -0.15467772 -0.09467538 -0.12735379 -0.20820434  0.06034918]
customer1 = "T0000001"
customer2 = "T0000004"
num_user = dataset.interactions_shape()[0]
user_x1 = mappings.kundennummer2row[customer1]
user_x2 = mappings.kundennummer2row[customer2]

user_embeddings_norm = (model.user_embeddings[:num_user].T
                  / np.linalg.norm(model.user_embeddings[:num_user], axis=1)).T

similarity = np.dot(user_embeddings_norm[user_x2], user_embeddings_norm[user_x1])

That's how I built the dataset:

dataset = Dataset()

dataset.fit(items=artikel_meta["Artikelnummer"],
                    users=kunden_meta["Hauptkundennummer"],
                    item_features=artikel_meta["Warengruppe"].unique(),
                    user_features=kunden_meta["Branchenschlüssel"].unique())

(interactions, weights) = dataset.build_interactions([(x['Hauptkundennummer'],
                                                       x['Artikelnummer'],
                                                       x['Kernumsatz']) for index,x in sales_data_2019_grouped.iterrows()])

def prepare_features_format(data, id, feature_columns):
    features = []
    for row in range(data.shape[0]):
        features.append([data[id][row],[str(data[feature][row]) for feature in feature_columns]])
    features = tuple(features)
    return features

item_features = dataset.build_item_features(prepare_features_format(artikel_meta,'Artikelnummer',['Warengruppe']))
user_features = dataset.build_user_features(prepare_features_format(kunden_meta,'Hauptkundennummer',['Branchenschlüssel']))

some translation: "Artikelnummer" = "item number", "Hauptkundennummer" = "Customer number", "Warengruppe" = "Product group", "Branchenschlüssel" = "Industry Code"

EthanRosenthal commented 4 years ago

User and item latent vectors are initialized with random numbers, so there is no guarantee that two users who interacted with the same items will have the same latent vectors. Even if they were initialized with the same latent vectors, lightfm performs stochastic gradient descent with a batch size of 1. Various parameters can change between batches, like the effective learning rate and the item latent vectors, so the two users' latent vectors will get updated differently during SGD.

momassimo commented 4 years ago

Hi Ethan,

thanks so much for the fast and good reply. That explains my question now perfectly.

An issue I still have is the following: Is there any way to interpret the embeddings? I have e.g. the case that the model says two customers are very "near" to each other. When I take a closer look to those customers, they have just one bought product in common and also the user features are not same. This would be really nice to give more insight on the question "why customers are considered to be similiar?"

thanks in advance!

Moritz

EthanRosenthal commented 4 years ago

Unfortunately, interpreting the embeddings is pretty difficult, and I can't offer much advice beyond what you've already looked at (i.e. products and user features in common).

One thing to make sure of when using user features is that you're comparing two users' similarity by their user representation, and not their user embedding. In case you don't know, the LightFM user representation is the sum of all of the user's user feature embeddings (including the user's user identity feature, if you are using that).

If you're using the user representation and still finding that two users who are nearest neighbors have few features and products in common, then maybe the model is poorly tuned?

momassimo commented 4 years ago

Thanks! As far as I understood the library with get_user_representation you also get the user_embeddings (and the user_biases), aren't they the same like model.user_embeddings? Or did I misunderstand you?

EthanRosenthal commented 4 years ago

Ah, I left out something in my explanation. You must provide the user feature matrix as an argument to get_user_representations(). While you can also use model.user_embeddings, you have to make sure that you add up all of the user's embeddings prior to calculating similarity with other users. I'm not also not sure if you missed this or not, so I'll walk through an example below just in case it's helpful!

Imagine you use the Dataset class to build both your interactions matrix and your user and item features. You set user_identity_features=True, and you have two other user features: device_is_ios and device_is_android, and each of these features can be 1 or 0.

If you build your user feature matrix, it will have shape num_users, num_user_features where num_user_features = num_users + 2. This is because you are building a unique user feature for each user as well as the 2 extra features. This also means that your user_embedding matrix will have shape num_user_features, num_components. That is, you get an embedding for each unique user feature and the extra device_is_* features.

So, when you want to calculate a user's "representation" in order to calculate similarity, you need to add up both the user's unique embedding and their device_is_* embedding together.

maciejkula commented 4 years ago

Thanks for all the explanations, Ethan!

EthanRosenthal commented 4 years ago

:raised_hands:

MPADAB commented 4 years ago

@EthanRosenthal Thank u Ethan!. When i am using items_features, the model has lower precision compare to pure CF. In your comment you mention devise_is_* - Does it must to be one hot format? For example, my items data:

      article_id section_primary      writer_name
0      1.9134852         culture  אפרת רובינשטיין
1      1.9141164         culture       אורון שמיר
2      1.9179619         culture      דייב איצקוף

So, I am building the features as item_features = dataset.build_item_features([(i.article_id,[i.section_primary,i.writer_name]) for i in items.itertuples()]) Thanks!