Can I use continuous variables as features. Such as Age, or Doc2Vec vector components.

HenrikRehbinder commented 3 years ago

I'd like to use for example age as a user feature or the components of e.g. a Doc2Vec as item features. Take the doc2Vec as example. Say I have 5 compoments docvec=[1.2, -1.0, 3.4, -4.9, 0.0] fortthe first item and [2.2, ....] for the second and so on. I can create columns so that my item feature sparse matrix is item= [ [0, 0, 1.2], [0, 1, -1.0], [0, 2, 3.4], [0, 3, -4.9] [0,0, 0.0], [1, 0, 2.2], [1, 1, ...] ] The matris is really not sparse, but is anyway represented as such. This works in the sense that fit() gives a result but it takes some time. My question is: Is this supposed to work? the article doesn't say explicitly yes or no. Any help would be appreciated.

meikuam commented 3 years ago

lightfm uses categorial features, so you should convert numerical to categorical and use it like "tag" for item or user. I will give to you simple toy example of data prepare/train pipeline:

from lightfm.data import Dataset

# your user ids
user_ids = [0,1]
# all available features
user_feature_list = [f"age_{age}" for age in range(30, 40)]
user_feature_list += ["male", "female"]
# your mappings user ids on features
user_features = [
    (
        0, 
        {
            "age_35",
            "male"
        }
    ),
    (
        1, 
        {
            "age_31",
            "female"
        }
    )
]

# your item ids
item_ids = [1,2,3,4]
# your features
item_features_list = [f"feature_{f}" for f in range(10)]
item_features = [
    (
        1, {
            "feature_0", "feature_1"
        }
    ),
    (
        2, {
            "feature_1", "feature_2"
        }
    ),
    (
        3, {
            "feature_0", "feature_3"
        }
    ),
    (
        4, {
            "feature_9", "feature_1"
        }
    ),
]
# create dataset class
dataset = Dataset(item_identity_features=False)
# add mappings
dataset.fit_partial(users=user_ids)
dataset.fit_partial(items=item_ids)
dataset.fit_partial(user_features=user_feature_list)
dataset.fit_partial(item_features=item_features_list)
# build sparce matrices with 
dataset_user_features = dataset.build_user_features(user_features)
dataset_item_features = dataset.build_item_features(item_features)
# build interaction matrices (pairs of (user_id, item_id))
x_train = dataset.build_interactions([
     (0, 1),
    (1, 3),
    (1, 4)
])

# so you can get mappings
[
    dataset_user_id_mapping,
    dataset_user_feature_mapping,
    dataset_item_id_mapping,
    dataset_item_feature_mapping
] = dataset.mapping()

"""
(I paste here outputs from jupyter notebook)
# here mapping your ids to internal representation
dataset_user_id_mapping = {0: 0, 1: 1}
# here mapping of user features to internal representation (0, and 1 keys here as we leave user_identity_features param in Dataset class by default)
dataset_user_feature_mapping = {0: 0,
 1: 1,
 'age_30': 2,
 'age_31': 3,
 'age_32': 4,
 'age_33': 5,
 'age_34': 6,
 'age_35': 7,
 'age_36': 8,
 'age_37': 9,
 'age_38': 10,
 'age_39': 11,
 'male': 12,
 'female': 13}

dataset_item_id_mapping = {1: 0, 2: 1, 3: 2, 4: 3}
dataset_item_feature_mapping = {'feature_0': 0,
 'feature_1': 1,
 'feature_2': 2,
 'feature_3': 3,
 'feature_4': 4,
 'feature_5': 5,
 'feature_6': 6,
 'feature_7': 7,
 'feature_8': 8,
 'feature_9': 9}

dataset_user_features (sparce matrix with 2 rows - users and cols - user_features, user_features are correspond to dataset_user_feature_mapping)
<2x14 sparse matrix of type '<class 'numpy.float32'>'
    with 6 stored elements in Compressed Sparse Row format>

dataset_item_features (same as user_features)
<4x14 sparse matrix of type '<class 'numpy.float32'>'
    with 12 stored elements in Compressed Sparse Row format>

x_train (2 sparse matrixes first - interactions, second - weights)
(<2x4 sparse matrix of type '<class 'numpy.int32'>'
    with 3 stored elements in COOrdinate format>,
 <2x4 sparse matrix of type '<class 'numpy.float32'>'
    with 3 stored elements in COOrdinate format>)
"""

# then we create model and train on toy data

from lightfm import LightFM

model = LightFM(no_components=32)
model.fit_partial(
    x_train[0],
    item_features=dataset_item_features,
    user_features=dataset_user_features,
)
"""
as a result we have trained:
model.item_embeddings - matrix of latent vectors with shape (10, 32) - latent vector for each feature
model.user_embeddings - matrix of latent vectors with shape (14, 32)
"""

When we predict scores for users and items we make dot product of user_embeddings and item_embeddings. If we have multiple features for one item, we make weighted product of matrixes (see method - model.get_item_representations()) where features - dataset_item_features from my toy example and item_embeddings - latent vectors for each feature.

By the way, each feature represented by latent vector which consist of num_components. if you have your own latent vectors, I think you can "hack" thinks by assign your doc2vec vectors to item_embeddings param of lightfm model (you should be careful with mappings in your dataset, and remember that they are will be changed after training). I don't know will it work or not, but you can try.

I think there is nothing to do with numerical features here as they are not supported here.

HenrikRehbinder commented 3 years ago

Thanks, I understand that I can create categorical features. My thinking was that similar to weighting a categorical feature with 0 or 1, I could theoretically weight a non-categorical feature (say "Age") with the actual age. In such a case the float-weight should be normalized to around 0 to 1 or -1 to 1

julioasotodv commented 3 years ago

Hi,

By looking at the source code, it looks like the user/item features can take any real number as their value, and it is multiplied by the estimated embedding vector for that feature (see https://github.com/lyst/lightfm/blob/d05289982928f81b957c1f0ff63dcaee0e915f3b/lightfm/_lightfm_fast.pyx.template#L315)

Whether you actually want to have or not real valued features in the model is up to you (you can try what works best)

With that said, it would be better if the developers confirmed this

HenrikRehbinder commented 3 years ago

Thanks a lot for the reply! It is appreciated.

Henrik

tors 3 juni 2021 kl. 12:52 skrev Julio Antonio Soto < @.***>:

Hi,

By looking at the source code, it looks like the user/item features can take any real number as their value, and it is multiplied by the estimated embeddings for that feature (see https://github.com/lyst/lightfm/blob/d05289982928f81b957c1f0ff63dcaee0e915f3b/lightfm/_lightfm_fast.pyx.template#L315 )

Whether you actually want to have or not real valued features in the model is up to you (you can try what works best)

With that said, it would be better if the developers confirmed this

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lyst/lightfm/issues/598#issuecomment-853778542, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANVWVAWSFS4ARWBUD3VUDUDTQ5NF5ANCNFSM4ZZGYQXA .

lyst / lightfm

Can I use continuous variables as features. Such as Age, or Doc2Vec vector components. #598