Closed glouppe closed 9 years ago
There are still a few things to fix, but this is ready for reviews. @MSusik @etzemis @natsheh
So we have a bottleneck on the coauthors features. Because of the number of authors in physics papers, computing the TF-IDF vectors for the coauthor field is really taking some time.
By investigating some more, the true culprit is that the current implementation (in affinity
) is recomputing the vector representation of a signature for every pair in which it appears. We are therefore building O(N^2) such vectors, instead of only O(N) (where N is the number of signatures). I'll try to come up with something to avoid this, while keeping the current API and workflow.
I fixed the bottleneck by transforming only unique elements in PairTransformer
.
Ready for reviews.
I would add some comments in _flatten
function. It is not clear, what the function is doing without taking a deeper look into it.