rixwew / pytorch-fm

Factorization Machine models in PyTorch
MIT License
1.04k stars 225 forks source link

why the fm model didn't consider the data value? #35

Open Amoshen opened 2 years ago

Amoshen commented 2 years ago

In FeaturesLinear

def __init__(self, field_dims, output_dim=1):
    super().__init__()
    self.fc = torch.nn.Embedding(sum(field_dims), output_dim)
    self.bias = torch.nn.Parameter(torch.zeros((output_dim,)))
    self.offsets = np.array((0, *np.cumsum(field_dims)[:-1]), dtype=np.long)

def forward(self, x):
    """
    :param x: Long tensor of size ``(batch_size, num_fields)``
    """
    x = x + x.new_tensor(self.offsets).unsqueeze(0)
    return torch.sum(self.fc(x), dim=1) + self.bias

The model only assign weight value for each field as embedding, but it didn't consider the data value here. For example, if we has a data as [[1,0]], the model will work embedding("1")+embedding("0"), but in real, we need work embedding("1")1+embedding("0")0.

What reason the model delete the effect of data value?

tommasocarraro commented 1 year ago

The implementation is proper. Consider that Rendle implementation uses one-hot vectors, so when you compute the sum you are mentioning, just one user will be at 1, and just one item will be at 1. In this implementation, you are not using one-hot vectors. Instead, you are using indexes. That 0 in your example is the same as the one-hot vector for the first item, which should be [1,0,0, ..., 0]. So, the implementation is correct as it's summing the feature of user 1 with the feature for item 0. This is the same as multiplying the corresponding one-hot vectors with the feature vector w in Rendle implementation.

A more interesting question could be: "How does this implementation work with continuous features and multi-hot vectors?". Everything in this implementation is used as an index, assuming each variable is encoded as a one-hot vector. However, the original implementation can work with any possible type of feature vector in input.

In particular, I am interested in the multi-hot question. I read the implementation, which assumes the dataset comes as user-item indexes followed by the ground truth. However, if I want to add movie genres as multi-hot features, how can I do that with this implementation? If we assume each movie has just one genre, it is straightforward since it is enough to add a column after the item index column. If, instead, we have multiple genres for each movie, how is it possible to model that with the current implementation?

To reformulate better, an example in the dataset could be encoded as follows, following the syntax of Rendle: [1 0 0 0 0 0 0][0 0 1 0 0 0 0 0 0 0 0 0 0 0][1 0 1 0 0 0 1 0 0 1][4] The meaning of the vectors is the following: user | item | item genres/categories | rating

Clearly, the item genres are encoded as multi-hot vectors.

I want to model this data with factorization machines using the implementation provided in this repository.

Thank you very much for your time and effort in realizing such a suitable repository.