Open Aske-Rosted opened 1 week ago
I’ve been thinking about how we can make the clustering (especially the upcoming new NodeDefiniton
class lets call it Cluster
) more flexible and easier to maintain, and I have a suggestion. Instead of having a function within the cluster_and_pad
class to create new cluster features, we could create a new class called ClusterFeature
that handles summarizing each feature. The class might look something like this:
class ClusterFeature(Model):
"""Base class for summarizing feature series per cluster."""
@abstractmethod
def summarisation_feature_label(self):
"""Returns the label for the summarization feature."""
@abstractmethod
def add_feature(self):
"""Calculates the summarization feature."""
For every summarization feature we could then create a new subclass of this. E.g. for Totals of input features, Time of first pulse, Cumulative value of input feature after specific times, ....
Then, the new NodeDefinition
(which we’ll call Cluster
) can just take a list of these ClusterFeature
instances. For example:
features = [
Percentiles(percentiles=[10, 30, 50, 70]),
Totals(),
CumulativeAfterT(times=[10, 50, 40]),
TimeOfPercentiles(percentiles=[10, 30, 50, 70]),
Std(),
TimeOfFirstPulse(),
AddCounts(),
]
Cleaner Constructor: Instead of passing a ton of individual arguments to __init__()
, we’d just pass a list of ClusterFeature
instances. This will make the Cluster
class constructor much cleaner and easier to work with.
Modular and Flexible: Each summarization feature could be its own class, which makes it super easy to add new features down the line without cluttering the Cluster
class.
Easier Subsetting: You could easily calculate features on just a subset of the input data (like just the charge), instead of always calculating for everything at once.
Consistency: By looping through the ClusterFeature
list, we ensure that the feature labels and data always line up correctly, so there’s less risk of mistakes with mismatched ordering in the _get_indices_and_feature_names
and _construct_nodes
.
Let me know what you think! I think this could make things more modular and easier to extend, but I’d love to hear your thoughts. I am happy to help out in any way.
I am aware that this would not add much functionality and is more a style choice I guess. I'm already very happy with this PR and I can build on this as well to implement some more summarization features once its through.
This PR add utility functions which will enable DOM-level clustering as described in #752. This PR does not include a NodesASDOM's definition but enables the creation of one. The new utilities completely contain the functionality of
cluster_summarize_with_percentiles
and as such a warning of this function being deprecated and a "TODO" of the removal of said function has been added so it can be removed some time in the future.The
PercentileClusters
node defintion has been changed to use the new utility class/function, and this node definition is tested in thegraphnet/tests/models/test_node_definition.py
test_percentile_cluster()
test.I plan to soon follow up with a new node definition but thought it would be better to not grow the PR too large.
I created some test to ensure the performance of the new utilities and gathered clustering times as seen below
With the tensor being some pseudo pulse data. The plots below show that the new utilities are about a factor 2 faster than the old implementation.