inspirehep / beard

Bibliographic Entity Automatic Recognition and Disambiguation
Other
66 stars 36 forks source link

Advanced author disambiguation #22

Closed glouppe closed 9 years ago

glouppe commented 9 years ago
glouppe commented 9 years ago

There are still a few things to fix, but this is ready for reviews. @MSusik @etzemis @natsheh

glouppe commented 9 years ago

So we have a bottleneck on the coauthors features. Because of the number of authors in physics papers, computing the TF-IDF vectors for the coauthor field is really taking some time.

By investigating some more, the true culprit is that the current implementation (in affinity) is recomputing the vector representation of a signature for every pair in which it appears. We are therefore building O(N^2) such vectors, instead of only O(N) (where N is the number of signatures). I'll try to come up with something to avoid this, while keeping the current API and workflow.

glouppe commented 9 years ago

I fixed the bottleneck by transforming only unique elements in PairTransformer.

Ready for reviews.

MSusik commented 9 years ago

I would add some comments in _flatten function. It is not clear, what the function is doing without taking a deeper look into it.