inspirehep / beard

Bibliographic Entity Automatic Recognition and Disambiguation
Other
66 stars 36 forks source link

block_clusterer.fit(X) string indices must be integers #102

Closed Gerwi closed 5 years ago

Gerwi commented 5 years ago

Thanks a lot for the effort put in developing this library. I am unfortunately encountering an issue when running the block_cluster.fit(X) line in one of the examples (it occurs both in python 2 and 3). The only further specification provided by python is that string indices must be integers.

Probably it is a very newbie issue, which might be related to one of the other libraries involved. Below my code:

from __future__ import print_function

import numpy as np

from beard.clustering import BlockClustering
from beard.clustering import block_last_name_first_initial
from beard.clustering import ScipyHierarchicalClustering
from beard.metrics import paired_f_score
from beard.utils import normalize_name
from beard.utils import name_initials

import scipy
import sklearn
import scikit-learn

def affinity(X): """Compute pairwise distances between (author, affiliation) tuples. Note that this function is a heuristic. It should ideally be replaced by a more robust distance function, e.g. using a model learned over pairs of tuples. """ distances = np.zeros((len(X), len(X)), dtype=np.float)

for i, j in zip(*np.triu_indices(len(X), k=1)):
    name_i = normalize_name(X[i, 0])
    aff_i = X[i, 1]
    initials_i = name_initials(name_i)
    name_j = normalize_name(X[j, 0])
    aff_j = X[j, 1]
    initials_j = name_initials(name_j)

    # Names and affiliations match
    if (name_i == name_j and aff_i == aff_j):
        distances[i, j] = 0.0

    # Compatible initials and affiliations match
    elif (len(initials_i | initials_j) == max(len(initials_i),
                                              len(initials_j)) and
          aff_i == aff_j and aff_i != ""):
        distances[i, j] = 0.0

    # Initials are not compatible
    elif (len(initials_i | initials_j) != max(len(initials_i),
                                              len(initials_j))):
        distances[i, j] = 1.0

    # We dont know
    else:
        distances[i, j] = 0.5

distances += distances.T
return distances

if name == "main":

Load data

data = np.load("author-disambiguation.npz",encoding='latin1')
X = data["X"]
truth = data["y"]

# Block clustering with fixed threshold
block_clusterer = BlockClustering(
    blocking=block_last_name_first_initial,
    base_estimator=ScipyHierarchicalClustering(
        threshold=0.5,
        affinity=affinity,
        method="complete"),
    verbose=3,
    n_jobs=-1)
block_clusterer.fit(X)
labels = block_clusterer.labels_

# Print clusters
for cluster in np.unique(labels):
    entries = set()

    for name, affiliation in X[labels == cluster]:
        entries.add((name, affiliation))

    print("Cluster #%d = %s" % (cluster, entries))
print()

# Statistics
print("Number of blocks =", len(block_clusterer.clusterers_))
print("True number of clusters", len(np.unique(truth)))
print("Number of computed clusters", len(np.unique(labels)))

print("Paired F-score =", paired_f_score(truth, labels))

YashSharma commented 5 years ago

Hey @Gerwi , just replace the signature["author_name"] with signature in blocking_funcs.py before running "python setup.py install".

Gerwi commented 5 years ago

Thanks @YashSharma, the error disappeared by making this adjustment. However, I am still wondering whether the library is working properly. Executing this example takes for ever on my I7 processor.

YashSharma commented 5 years ago

@Gerwi Yes it takes forever to run with n_jobs = -1, change it to 1.

Gerwi commented 5 years ago

Thanks, that resolved the issue

MSusik commented 5 years ago

@YashSharma

Yes it takes forever to run with n_jobs = -1

That's actually intersting, it would mean there is way too much copying between processes. Will have to check it later.