Clustering utilities - Githubissues

This PR add utility functions which will enable DOM-level clustering as described in #752. This PR does not include a NodesASDOM's definition but enables the creation of one. The new utilities completely contain the functionality of cluster_summarize_with_percentiles and as such a warning of this function being deprecated and a "TODO" of the removal of said function has been added so it can be removed some time in the future.

The PercentileClusters node defintion has been changed to use the new utility class/function, and this node definition is tested in the graphnet/tests/models/test_node_definition.py test_percentile_cluster() test.

I plan to soon follow up with a new node definition but thought it would be better to not grow the PR too large.

I created some test to ensure the performance of the new utilities and gathered clustering times as seen below

    ts = time()
    test = cluster_summarize_with_percentiles(tensors, [3,4], [0,1,2], percentiles, add_counts=True)
    old_times.append(time()-ts)

    ts = time()
    cluster_class = cluster_and_pad(tensors,[0,1,2])
    cluster_class.percentile_summarize([3,4],percentiles)
    cluster_class.add_counts(cluster_class.clustered_x.shape[1])
    new_times.append(time()-ts)

    ts = time()
    cluster_class_charge = cluster_and_pad(tensors,[0,1,2])
    cluster_class_charge.add_charge_threshold_summary([3,4],percentiles,4)
    cluster_class_charge.add_counts(cluster_class_charge.clustered_x.shape[1])
    charge_cluster_times.append(time()-ts)

With the tensor being some pseudo pulse data. The plots below show that the new utilities are about a factor 2 faster than the old implementation. cluster_time_ratio cluster_time_scatter cluster_time_hist

Idea for improving Handling of Summarization Features in the Clustering nodes

I’ve been thinking about how we can make the clustering (especially the upcoming new NodeDefiniton class lets call it Cluster) more flexible and easier to maintain, and I have a suggestion. Instead of having a function within the cluster_and_pad class to create new cluster features, we could create a new class called ClusterFeature that handles summarizing each feature. The class might look something like this:

class ClusterFeature(Model):
    """Base class for summarizing feature series per cluster."""

    @abstractmethod
    def summarisation_feature_label(self):
        """Returns the label for the summarization feature."""

    @abstractmethod
    def add_feature(self):
        """Calculates the summarization feature."""

For every summarization feature we could then create a new subclass of this. E.g. for Totals of input features, Time of first pulse, Cumulative value of input feature after specific times, ....

Then, the new NodeDefinition (which we’ll call Cluster) can just take a list of these ClusterFeature instances. For example:

features = [
    Percentiles(percentiles=[10, 30, 50, 70]),
    Totals(),
    CumulativeAfterT(times=[10, 50, 40]),
    TimeOfPercentiles(percentiles=[10, 30, 50, 70]),
    Std(),
    TimeOfFirstPulse(),
    AddCounts(),
]

Why This Might Be Useful:

Cleaner Constructor: Instead of passing a ton of individual arguments to __init__(), we’d just pass a list of ClusterFeature instances. This will make the Cluster class constructor much cleaner and easier to work with.
Modular and Flexible: Each summarization feature could be its own class, which makes it super easy to add new features down the line without cluttering the Cluster class.
Easier Subsetting: You could easily calculate features on just a subset of the input data (like just the charge), instead of always calculating for everything at once.
Consistency: By looping through the ClusterFeature list, we ensure that the feature labels and data always line up correctly, so there’s less risk of mistakes with mismatched ordering in the _get_indices_and_feature_names and _construct_nodes.

Potential Concerns:

Speed: There might be a performance hit because of the extra abstraction but I don't have a feeling for that
Configuration: We’ll want to make sure this change doesn’t mess with any of the config file stuff or other initialization logic. (I am not too familiar of how the yaml files are being used)

Let me know what you think! I think this could make things more modular and easier to extend, but I’d love to hear your thoughts. I am happy to help out in any way.

I am aware that this would not add much functionality and is more a style choice I guess. I'm already very happy with this PR and I can build on this as well to implement some more summarization features once its through.

graphnet-team / graphnet

Clustering utilities #770

Idea for improving Handling of Summarization Features in the Clustering nodes

Why This Might Be Useful:

Potential Concerns: