s-leroux / fin

Set of tools for personal investment
MIT License
1 stars 0 forks source link

Make it easy to calculate values on a complete column. #50

Open s-leroux opened 4 months ago

s-leroux commented 4 months ago

We should make it easy to calculate values on a complete column. Currently, if you want the mean of a column in a series, you have to write:

from fin.seq.serie import Serie
from fin.seq import fc, ag

ser = Serie.from_csv_file(
        "tests/_fixtures/MCD-20200103-20230103.csv",
        format="dnnnnni"
    ).group_by(
        fc.constant(True),
        (ag.first, "Date"),
        (ag.avg, "Open", "Close"),
    )

print(ser)
avg_open = ser["Open"].columns[0][-1]
print(avg_open)

We can clearly do better. This may imply:

We may leverage aggregate functions for that. Ideally, aggregate functions may also be usable as window functions. This would be less efficient than specifically designing code, but it would avoid code duplication.

Currently, the problem is that aggregate functions are designed to apply to a set of columns rather than individual columns for efficiency reasons. Notice the for col in cols list comprehension in the code below:

https://github.com/s-leroux/fin/blob/6ba60419037c97535bf272a0348aa12a1f04a555/fin/seq/ag/core.py#L25-L30

s-leroux commented 4 months ago

Adding a way to directly access data columns by name (FWIW, indices are stored independently from data columns)

What about:

open_column = ser.columns["Open"]

We need to change the Serie.column property to return a proxy toward Serie._column that implements __getitem__ both for indices and column names.

s-leroux commented 4 months ago
  • [...] (FWIW, indices are stored independently from data columns)

This seems to create more problems than it solves. It is especially confusing from the user's point of view since column indices are "shifted by 1" compared to what can be displayed by print(serie).

I will evaluate the impact of changing that.

s-leroux commented 4 months ago
  • [...] (FWIW, indices are stored independently from data columns)

This seems to create more problems than it solves. It is especially confusing from the user's point of view since column indices are "shifted by 1" compared to what can be displayed by print(serie).

I will evaluate the impact of changing that.

It is not obvious which version of the code is clearer, either with separate index and columns, or with index unified in the _columns tuple. I left the code as it was. But I renamed _columns and the corresponding property in _data. And there a "new" Serie.columns property that returns (_index, *_data).

s-leroux commented 4 months ago

We may leverage aggregate functions for that. Ideally, aggregate functions may also be usable as window functions. This would be less efficient than specifically designing code, but it would avoid code duplication.

The new aggregate function prototype should be something along the lines of:

def avg(col: Column, begin: int, end: int)

For efficiency, I suggest implementing the root aggregate function class as an extension type (Cython). A derived class would serve as a bridge between the Cython and Python worlds.

s-leroux commented 4 months ago

A solution implemented in bd3688c856ea1ce3b7efba021428e38ca937172a:

from fin.seq.serie import Serie
from fin.seq import fc, ag

ser = Serie.from_csv_file(
        "tests/_fixtures/MCD-20200103-20230103.csv",
        format="dnnnnni"
    ).group_by(
        fc.constant(True),
        (ag.first, "Date"),
        (ag.avg, "Open", "Close"),
    )

print(ser)
avg_open = ag.avg(ser.columns["Open"])
print(avg_open)
s-leroux commented 3 months ago

Reopening this issue.

Concerning "window functions", that is, column functions that operate on columns through a rolling window or values: There are obvious pathways between "aggregate functions" and "window functions".

The core differences are aggregate functions:

On the opposite side, window functions, like other column functions:

An aggregate function can be seen as a window function whose window length equals the data length. On the other hand, a window function can be seen as the successive application of an aggregate function on each window in its turn.

A strict class hierarchy is not obvious between aggregate and column functions, though. Especially given the polymorphic nature of columns' data.

Leveraging Python's dynamic nature, we may add the asAggregateFunction and asWindowFunction methods to all function instances to return an adapter of the adequate type. The goal here is to reduce code duplication between aggregate and column functions.

s-leroux commented 3 months ago

On the opposite side, column functions:

* May operate on 0 to n columns

* Return 1 or several columns

* Only (in the current implementation) operate on full columns

Also, most window functions have a "warm up" period where their value is undefined until enough data are handled. As an example, the average(10) window function will return a sequence of 9 NaN and only a value starting with the 10th cell.

Many algorithms can be optimized for rolling windows. As a corollary, aggregate functions implementation if usually simpler. Considering the implementation of a new function—both usable as an aggregate function or window function, we may:

  1. Start by implementing the simpler aggregate function and using a transparent adapter from aggregate to window function regardless of performance considerations.
  2. If it appears an optimization for a rolling window is required, we may have both implementations live in the code.
  3. To avoid code duplication and ease maintenance, the aggregate function might be replaced by an adapter from the window to the aggregate function. Once again, ignoring the performance penalties.

So, depending on the use case, we may need both "window -> aggregate" and "aggregate -> window" adapters.