hobuinc / silvimetric

Apache License 2.0
8 stars 4 forks source link

Kmann/metrics #91

Closed kylemann16 closed 2 months ago

kylemann16 commented 3 months ago

Working off of PR #70 to add Metric Dependencies so that we don't run the same methods more than once.

The core piece that was added are intermediate metrics, which are not represented in tiledb, but are metrics that can be used as sort of "base" methods. Here is an example from p_moments.py, where there is a method dependent on the mean being done, and several others which are dependent on the base moments method being done. This moments intermediate metric then passes all the necessary values are args to its dependent methods.

def m_moments(data, *args):
    mean = args[0]
    return moment(data, center=mean, order=[2,3,4], nan_policy='omit').tolist()

def m_mean(data, *args):
    return np.mean(data)

def m_variance(data, *args):
    return args[0][0]

def m_skewness(data, *args):
    return args[0][1]

def m_kurtosis(data, *args):
    return args[0][2]

mean = Metric(name='mean', dtype=np.float32, method=m_mean)
moment_base = Metric(name='moment_base', dtype=object, method=m_moments, dependencies=[mean])
variance = Metric(name='variance', dtype=np.float32, method=m_variance, dependencies=[moment_base])
skewness = Metric(name='skewness', dtype=np.float32, method=m_skewness, dependencies=[moment_base])
kurtosis = Metric(name='kurtosis', dtype=np.float32, method=m_kurtosis, dependencies=[moment_base])

In order to accomplish this, a dependency graph was needed. Dask's dependencies are generally good enough, but because of how separated these methods can get in the workflow, I found it easier to create Delayed objects with keys that have uuids associated with specific data runs. This way dask can understand that moment_base requires mean, and it can know which of the potentially 1000s of mean methods being run is the correct one.