Thoughts on bringing heavylight under a single API?

lewisfogden commented 4 months ago

Changes I'm thinking. @MatthewCaseres would like your views too

BeforeRun / AfterRun to be removed as not useful.
Initialisation: remove need for user __init__ (i.e. Model behaviour rather than LightModel behaviour).
Backends, combine with a parameter backend in Model, covering:
- Standard (i.e. current model), non optimising
- t-2 simple aggregator that applies agg_func on all methods taking t as a parameter.
- graph Memory Optimised
t-2 and graph both take the agg_func parameter, which defaults to np.sum
for graph which requires a pre-run to optimise, this samples 50 points (say) from the dataset, but could pass a optimiser_data optional parameter?
All other keyword arguments become attributes of the class instance (e.g. data/basis) - as current heavylight.Model behaviour.
Use of the backend allows us to add different optimisers in future (potentially even a compiler)

The caller might look like this:

proj = Model(data=data, basis=basis, proj_len=120, backend='t-2', agg_func=np.mean)

lewisfogden commented 4 months ago

Do we need more than one backend? Ignoring syntax differences which we can bridge, what are the benefits of each?

MatthewCaseres commented 4 months ago

graph can't be same class due to special methods, use abstract base class?

Graph requires an extra ClearAndOptimize method, so it might not be an instance of same class as the non-graph stuff, maybe related through some abstract base class at best. There are certain tradeoffs such that t-2 (allows control flow to be more dynamic, no pre-run) and graph (longer range dependencies) can both be offered and both provide value.

The special method means it can't be of the same class I believe. I propose a minimal interface be agreed upon for properties and methods, like this -

interface of the model

def RunModel(self, proj_len: int):
def Clear(self):
@property
def cache(self):
@property
def cache_agg(self):
@property
def df(self):
@property
def df_agg(self):

interface of the cached method

@property
def cache(self):
@property
def cache_agg(self):
@property
def df(self):
@property
def df_agg(self):

You have keys values sum on your method are all calling the equivalent methods on the cache. Proposal to nuke it and make people use the cache.

maybe the graph sucks anyways

It is probably more memory efficient like 2-10x and handles the long range dependencies, so maybe don't delete it, but maybe the t-2 is just so much simpler that it be recommended. And then the graph is relegated to some advanced usage section or something idk, and t-2 trick is featured more prominently.

we can all agree

no issues
Not only in the memory optimized ones, but always provide the agg_func so that people always get the cache_agg and df_agg even with the base model

might agree

Feel that it tries to do a bit too much for the user. The memory optimization with the graph is kind of icky compared to not doing it and can try to hide that from the user but then the code becomes icky. So I would rather let the user know how it works and then they have to make the API calls to get it running. Especially if graph is considered an advanced scenario or something.

certain disagreement

2/6. My understanding is that avoiding the init is not idiomatic Python. Need to understand why avoiding it is desirable other than saving a few lines of code. PyTorch for example makes you do the super() on nn.Module. My disagreement with your approach is I don't know it to be widely in use, and it breaks typing. would like evidence that your approach is a common practice in Python libraries. a. I will be incredibly hard to convert on this because your approach breaks typing.

MatthewCaseres commented 4 months ago

Update: My previous thoughts under heading "certain disagreement", no longer disagreement on this issue. I will update my code to match the init behavior of the base Model.

What you are doing seems to be similar to what https://docs.kidger.site/equinox/api/module/module/ does. And they take the same approach to type hinting you have described previously.

lewisfogden commented 4 months ago

Defined default behaviour with option to override sounds good, if it mostly calls other methods (e.g. set up cache / run etc) then users can easily override.

I've been working on a model dependency viewer (demo: https://lewisfogden.github.io/heavylight/experiments/term_assurance_graph.html)

I found that this was easy to code when I defined the data/basis items as class instances (I made them both dataclasses), which also helps with type inference.

E.g.

class Data:
    annual_premium: np.ndarray

class UserModel(Model):
    data: Data

MatthewCaseres commented 4 months ago

I could probably generate the graph as well, I use a stack to track dependencies between items. I'm trying to understand what you mean by the model dependencies being easier to generate when Data is a dataclass, I hope the input data isn't related to graph generation?

I usually take any dataframes I use and put them in a class that makes all the columns into numpy arrays. Usually looks like

class Modelpoints(df: pd.DataFrame):
    self.pols_if = df["pols_if"].to_numpy()
    ... more of that ...

lewisfogden commented 3 months ago

When Data is a dataclass (or a normal class like yours) - I am using the dis and inspect packages to read in the model and convert it into a graph - this is for documentation / diagram type view rather than for running the model. Using dis, it's just a little easier to reason about the fields in a dataclass, as a dictionary is defined at run-time and isn't structural.

At the moment though, I'm using dictionaries, to get from dataframe I do something like:

data = {col:df[col].values for col in df.columns}

MatthewCaseres commented 3 months ago

I now see you are tracking data dependencies and that is why the data format is relevant.

Let me see if I can enumerate the changes that need to be made to have more of a single API.

LightModel implements:
- keys, values, sum, items on methods.
- Constructor modified to be more like def __init__(self, *, do_run = None, proj_len:int = 0, **kwargs,):
Model implements:
- items on methods
- Cache can be globally cleared
- Removal of proj_len from constructor
- Removal of do_run, BeforeRun, AfterRun
- Addition of cache property
Both
- Using the '/' thing to throw errors when kwarg is supplied to method
Uncertain
- Maybe both inheriting from an ABC, maybe not needed due to duck typing
- Automatic aggregations on plain Model + cache_agg df_agg properties

lewisfogden / heavylight