maciejkula / spotlight

Deep recommender models using PyTorch.
MIT License
2.97k stars 421 forks source link

add intra_distance_score evaluation #103

Open mokarakaya opened 6 years ago

mokarakaya commented 6 years ago

Issue #90 (Diversification metrics for evaluation)

Intra_distance diversity is probably mostly considered metric of diversity. Therefore I'd like to add it first.

maciejkula commented 6 years ago

Thank you for this!

I think to move this forward, we'll want to do two things.

  1. Can we add some references to the metric in the docstring? Maybe some papers that use it?
  2. We'll need to make this fast. From a cursory glance at the code, I suspect it's incredibly slow.

On making this fast: as far as I understand, we want to compute the cosine distance between the columns i and j of the (sparse) interaction matrix. There are a couple of things that we can do to make this faster than the current implementation:

  1. We convert the interactions object to a sparse matrix, transpose it, and make it CSR. This way, rows represent items. Call this mat.

  2. We get the lengths of the item vectors by calling lenghts = mat.getnnz(axis=1) (I think, the axis argument always throws me).

  3. We can get the length of the intersection of i and j by doing

    numerator = np.in1d(mat[i].indices, mat[j].indices, assume_unique=True).sum()
    denominator = lengths[i] * lengths[j]
    distance = numerator / denominator

    This way we don't need the cache either.

  4. Let's return this in a (num_users, k * (k-1) / 2) array (it's a list of lists right now).

We can make it even faster by using the fact that indices are sorted and using numba, but this is probably a good first step.

mokarakaya commented 6 years ago

Thank you very much for the comments. I agree. I'll check and update accordingly.

In addition to these comments, I'm still looking for a way to move distance function to input parameters. The current function will be the default one since it's really fast (I'll post exact times separately)

mokarakaya commented 5 years ago

I've fixed the review comments except for the sequence-based models solution.

calling intra_distance_score function in tests takes 6 seconds (only the function) when I run locally.

I should check how we can achieve to run this on sequence-based models; We need to convert the sequence to items array with user_ids in order to compute the distance between items.

So we need a matrix like this to calculate the distance; [[userId1, userId2, userId3], [userId2, userId3]]

where each row represents an item.

Any advise or guidance would be greatly appreciated.