Comparison between weighted and unweighted hypergraphs in clustering

young917 commented 1 year ago

Hello, thank you for sharing this excellent library.

I want to compare the performance of clustering on weighted and unweighted hypergraphs as described in K. Hayashi, S. Aksoy, C. Park, H. Park, "Hypergraph random walks, Laplacians, and clustering". I tried to do this based on Tutorial 11. However, I think some modifications need. h = hnx.Hypergraph(hnx.StaticEntitySet(data=data, weights=w)) I think this code does not make a weighted hypergraph. Instead, this should be revised as below. h = hnx.Hypergraph(hnx.StaticEntitySet(data=data, weights=w), weights=w)

Additionally, I cannot get the satisfying result that weighted hypergraphs perform better than unweighted ones. Thus, I want to ask whether the below code is a correct way to make clustering on an unweighted hypergraph and evaluate clustering algorithms by NMI scores.

# get data
categories = all_categories[[1,15]]
twenty_train = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)
doc_types=dict()
for i,x in enumerate(twenty_train.filenames):
    doc_types[i]=x.split('/')[-2]
tfidf_vect = TfidfVectorizer()
X_tfidf = tfidf_vect.fit_transform(twenty_train.data)

# construct hypergraph
mat = coo_matrix(X_tfidf)
edges = mat.col
nodes = mat.row
data = np.array([edges,nodes]).T
weights = mat.data
# clustering on weighted hypergraph
h = hnx.Hypergraph(hnx.StaticEntitySet(data=data,weights=weights),weights=weights)
clusters=hnx.spec_clus(h,num_clus,weights=True)
# clustering on unweighted hypergraph
uwh =  hnx.Hypergraph(hnx.StaticEntitySet(data=data,weights=None))
uw_clusters = hnx.spec_clus(uwh,num_clus,weights=False)

# construct clustering label list
categoryindexing = {}
_labels = {} # answer
for i in range(X_tfidf.shape[0]):
    if doc_types[i] not in categoryindexing:
        categoryindexing[doc_types[i]] = len(categoryindexing)
    ci = categoryindexing[doc_types[i]]
    _labels[i] = ci
labels = [_labels[i] for i in range(X_tfidf.shape[0])]
_w_pred = {} #  labels from weighted hypergraph
for i in clusters:
    for v in clusters[i]:
        _w_pred[v] = i
w_pred = [_w_pred[i] for i in range(X_tfidf.shape[0])]
_uw_pred = {} # labels from unweighted hypergraph
for i in uw_clusters:
    for v in uw_clusters[i]:
        _uw_pred[v] = i
uw_pred = [_uw_pred[i] for i in range(X_tfidf.shape[0])]

# evaluation on NMI score
from sklearn.metrics.cluster import normalized_mutual_info_score as NMI_SCORE
score = NMI_SCORE(labels, w_pred)
print(score) # 0.73
score = NMI_SCORE(labels, uw_pred)
print(score) # 0.78

brendapraggastis commented 1 year ago

We have contacted the authors and will look with them at your code. Will post when we have more information.

thosvarley commented 1 year ago

I have run into the same issue. If I run:

hnx.Hypergraph(edges, weights=[w1, w2, w3...]) the resulting hypergraph has all weights of 1.0

If I try the line that young917 suggested above, it doesn't work either.

brendapraggastis commented 1 year ago

This is a bug. We are working on it.

brendapraggastis commented 1 year ago

@thosvarley and @young917 HNX 2.0 will be released on Saturday May 13. You will add cell weights to the incidence matrix using the cell_weights keyword in the hypergraph constructor. Please read the documentation for formatting and let us know if anything is unclear.

pnnl / HyperNetX

Comparison between weighted and unweighted hypergraphs in clustering #99