A few functions and mofications added in main.py, minor modification in aligned_clustering_layer.

The modifications are as follows :

1 - code/main.py :

1.1 - Added fit_without_embedding() : Useful for grid searching, offers the option to pass an embedded dataset, as well as Aligned-UMAP results in case the dimensionality reduction results are the same during consecutive iterations of grid searching.

1.2 - added get_cluster_info() : This method describes the clustering results returned by HDBSCAN. It returns : ,,,,

num_clusters : The number of clusters per slice.
number_of_outliers : The number of outlier documents per slice. number_of_ones : The number of documents assigned to their respective clusters with a maximum membership score ( = 1). average_probabilities : The average membership score of each cluster in each slice. period_cluster_sizes : The size (number of documents) in each cluster in each slice.

1.3 - added pretty_print_cluster_info() : This method prints the results returned by get_cluster_info() in a more readable format.

1.4 - Added an ANTM attribute : self.cluster_proba, which is essentially the membership score array returned by HDBSCAN.

1.5 - During the Jaccard distance calculation in diversity_metrics.py, a division by zero (count = 0) sometimes occurred, which indicated that there’s potentially an empty list in the list of topics of ANTM. After further investigation and checking the self.output, it seems that “self.slice_num” doesn’t necessarily start from 1, in this case with a 2K dataset, the present values of “slice_num” were 2,3 and 4. This resulted in the appending of an empty list in self.topics during the fitting process. After modifying the code as follows :

In main.py, specifically in fit() and fit_without_embedding() : self.slice_num = set(self.output["slice_num"]) self.topics = [self.output[self.output["slice_num"] == i].topic_representation.to_list() for i in self.slice_num] As well as the same modifications issued in load()

It seems to have fixed the issue.

2 - aligned_clustering_layer.py

2.1 - hdbscan_cluster() to return membership scores alongside labels.

hamedR96 / ANTM

A few functions and mofications added in main.py, minor modification in aligned_clustering_layer. #1

1 - code/main.py :

2 - aligned_clustering_layer.py