Open mokarakaya opened 6 years ago
Thank you for this!
I think to move this forward, we'll want to do two things.
On making this fast: as far as I understand, we want to compute the cosine distance between the columns i and j of the (sparse) interaction matrix. There are a couple of things that we can do to make this faster than the current implementation:
We convert the interactions object to a sparse matrix, transpose it, and make it CSR. This way, rows represent items. Call this mat
.
We get the lengths of the item vectors by calling lenghts = mat.getnnz(axis=1)
(I think, the axis argument always throws me).
We can get the length of the intersection of i and j by doing
numerator = np.in1d(mat[i].indices, mat[j].indices, assume_unique=True).sum()
denominator = lengths[i] * lengths[j]
distance = numerator / denominator
This way we don't need the cache either.
Let's return this in a (num_users, k * (k-1) / 2)
array (it's a list of lists right now).
We can make it even faster by using the fact that indices are sorted and using numba, but this is probably a good first step.
Thank you very much for the comments. I agree. I'll check and update accordingly.
In addition to these comments, I'm still looking for a way to move distance function to input parameters. The current function will be the default one since it's really fast (I'll post exact times separately)
I've fixed the review comments except for the sequence-based models solution.
calling intra_distance_score
function in tests takes 6 seconds (only the function) when I run locally.
I should check how we can achieve to run this on sequence-based models; We need to convert the sequence to items array with user_ids in order to compute the distance between items.
So we need a matrix like this to calculate the distance; [[userId1, userId2, userId3], [userId2, userId3]]
where each row represents an item.
Any advise or guidance would be greatly appreciated.
Issue #90 (Diversification metrics for evaluation)
Intra_distance diversity is probably mostly considered metric of diversity. Therefore I'd like to add it first.