hamedR96 / ANTM

Aligned Neural Topic Model (ANTM) for Exploring Evolving Topics: a dynamic neural topic model that uses document embeddings (data2vec) to compute clusters of semantically similar documents at different periods, and aligns document clusters to represent topic evolution.
MIT License
33 stars 7 forks source link

A few functions and mofications added in main.py, minor modification in aligned_clustering_layer. #1

Closed Allaa-boutaleb closed 1 year ago

Allaa-boutaleb commented 1 year ago

The modifications are as follows :

1 - code/main.py :

1.1 - Added fit_without_embedding() : Useful for grid searching, offers the option to pass an embedded dataset, as well as Aligned-UMAP results in case the dimensionality reduction results are the same during consecutive iterations of grid searching.

1.2 - added get_cluster_info() : This method describes the clustering results returned by HDBSCAN. It returns : ,,,,

num_clusters : The number of clusters per slice.
number_of_outliers : The number of outlier documents per slice. number_of_ones : The number of documents assigned to their respective clusters with a maximum membership score ( = 1). average_probabilities : The average membership score of each cluster in each slice. period_cluster_sizes : The size (number of documents) in each cluster in each slice.

1.3 - added pretty_print_cluster_info() : This method prints the results returned by get_cluster_info() in a more readable format.

1.4 - Added an ANTM attribute : self.cluster_proba, which is essentially the membership score array returned by HDBSCAN.

1.5 - During the Jaccard distance calculation in diversity_metrics.py, a division by zero (count = 0) sometimes occurred, which indicated that there’s potentially an empty list in the list of topics of ANTM. After further investigation and checking the self.output, it seems that “self.slice_num” doesn’t necessarily start from 1, in this case with a 2K dataset, the present values of “slice_num” were 2,3 and 4. This resulted in the appending of an empty list in self.topics during the fitting process. After modifying the code as follows :

In main.py, specifically in fit() and fit_without_embedding() : self.slice_num = set(self.output["slice_num"]) self.topics = [self.output[self.output["slice_num"] == i].topic_representation.to_list() for i in self.slice_num] As well as the same modifications issued in load()

It seems to have fixed the issue.

2 - aligned_clustering_layer.py

2.1 - hdbscan_cluster() to return membership scores alongside labels.