A high performance multi-get ?

LB365 commented 1 year ago

Business need All users want a fast/efficient tsa.multiget that is not just a for-loop (and we default to that in all our scripts / tshistory_xl when we need it). The potential speedup comes from the fact that the different series we jointly request have a very similar dependency structure (typically balances) i.e. we want to pull (series "complicated"), (cumsum (series "complicated")), etc .

In designing an efficient "multiget", we should try to limit the amount of repeated computations that a naive "for-loop" approach would entails. A first simple multiget endpoint would that the following signature:

class timeseries:
...
def multiget(self, list_of_series, *getargs, *getkwargs) -> pd.DataFrame:
    return frame
print(frame)
%==================
name,date,value
complicated,2021-1-1,6.0
complicated2,2022-1-1-,5.0

Potential logic

From what I understand in the codebase(https://github.com/pythonianfr/tshistory_formula/blob/main/tshistory_formula/evaluator.py#L119), when I evaluate (add (cumsum (series "a") (cumsum (cumsum (series "a")))), it compute all the terminal leaves and pass the result to the higher formula level.

tshistory_formula/evaluator.py#L119

    # at this point, we have a function, and all the arguments
    # have been evaluated, so we do the final call
    print(f"tree being evaluated: {tree}")
    return proc(*posargs, **kwargs)
%==============================
tree being evaluated: ['cumsum, ['series, 'a']]
tree being evaluated: ['cumsum, ['series, 'a']]
tree being evaluated: ['cumsum, ['cumsum, ['series, 'a']]]
tree being evaluated: ['add, ['cumsum, ['series, 'a']], ['cumsum, ['cumsum, ['series, 'a']]]]

Key idea: Each time we run into the same previously computed tree, we should serve the request from memory.

How about encoding the tree variable as a string via a hash, to be able to wrap the _evaluate function in a cache decorator. Caching the env function would require a bit more thoughts... maybe the variable needs to be in the outer scope of the _evaluate function? or not be mutated at all like here (https://github.com/pythonianfr/tshistory_formula/blob/main/tshistory_formula/evaluator.py#L64)?

It does seem that with the new let implementation to tackle request boundaries logic, it can be a tricky thing to re-engineer, although datetime boundaries can be easily hashed if this is the only part being mutated.

The standard lru_cache from functools seems threadsafe, so we could still keep the threadpool implementation https://github.com/python/cpython/commit/3b704874592a3a1895c9a69a39c23ea22d1d06ca

A second slight optimisation would be to think about ordering nice_order(list_of_series) so that we have the maximum numbers of cache hits. We could think about text distances (https://pypi.org/project/textdistance/?) on hashed trees to order from the least similar to most similar? The end of the list would be served relatively easily via cache hits.

With that, we could effectively hit the cache every time we ran into the same tree, which would be very useful for a list_of_series being requested. That would potentially speed up the requests and pave the way for an improved get_many ?

Happy to hear your thoughts. @zogzog @natacha-pythonian Loic

zogzog commented 1 year ago

Thanks for the proposal. I did a poc for something like that a few years ago but the idea can certainly be revisited. I agree there is an opportunity for interesting optimisations there (I did the caching part). The topological sort is already provided in the refinery repo (for the purposes of making the cache computation efficient) and may be recycled for your second idea.

LB365 commented 1 year ago

Ah ok very cool, where is the caching you mention ? is it just caching functions or formula outputs?

zogzog commented 1 year ago

The caching I mention was in the POC I did back then, to avoid repeatedly requesting/computing the same series while loading a bunch of them all at once. That implementation was not going to fly anyway, I just wanted to test the performance (and it showed significant promise). It got overriden, in the context of the power dashboards, by triggering asynchronous computations in the client-side. Net effect was worse actual performance but better perceived performance.

zogzog commented 1 year ago

In "for the purposes of making the cache computation efficient" I am talking about the formula cache system.

LB365 commented 1 year ago

Here you go, https://github.com/pythonianfr/tshistory_formula/pull/3 a first implementation idea for the cached evaluation, i haven't implemented the tsa.multiget properly yet, as i would like to have your opinion on the cache_pevaluate implementation first. Speedups for the multiget ranges from 100% to 200%, and likely better when formulas get really big and we request series that are nodes of the same large formula (typical in balances).

zogzog commented 1 year ago

I put this there https://todo.sr.ht/~pythonian/the-data-refinery/193 for the record. The multi-get must work for all series of course and must do something efficient in all cases.

Needless to say I am also not especially fond of your implementation. In the evaluator, the changes should be minimal if not nil.

We probably want fast multi get for:

a list of primary series (using a thread pool ought to be enough)
a list of formulas -> here, pre compute the things that needs to be cached rather than blindly cache everything by default, which would seriously raise the default memory consumption for little benefits, then give this precomputed cache to the evaluator ; and we want to handle all autotrophic operators, not just series
mixed bag : probably first precompute the things that need to be cached and then see how the load can be partitioned

When caching stuff, the key should be the series id + requests params. Remember that in a formula those can vary (see the asof, time-shifted etc. operators for why). Since this is fundamentally about performance, let's take the time to address all relevant performance aspects from the start.

pythonianfr / tshistory_formula

A high performance multi-get ? #2