Closed dberardo closed 2 years ago
Hey man.
You're describing federated learning. The short answer is that we don't cover this, so no you can't distribute the training of a model.
If you take a deep look at River, you'll see that Mean
and Var
are mergeable:
from river import stats
stats.Mean().update(1).update(2) + stats.Mean().update(3).update(4)
That's the lowest building block of federated learning: merging statistics.
The trick is that the merging is dependent on the underlying algorithm. You can write a generic function for this. I'm not closed to having merge-able models in River, I just wouldn't know where to start. Maybe that decision trees could be merged, I don't know. It sounds a bit like science fiction to me as of now, but who knows.
thanks for the reply, i just wanted to go sure that this matter was indeed "a conceptual issue" and not something i might have missed along the way xD
i will perhaps implement my own strategy to pick the best model to store based on some kind of rolling metrics. but this will be done when needed.
i think however, that model federation could belong to beaver (when ready), since a centralized service where the merging happens needs to exist.
this is perhaps more of a conceptual question then a MLops-related one, so it could be moved to RiverML.
what is the best approach to train an online learning model which is running in multiple parallel instances i am thinking for example at very high frequency applications where streaming data is partitioned over different model instances to achieve higher throughput or better load balancing.
the question here is: since every instance of the model will look at just a portion of the data, how should the trained models params be aggregated to get a unique model?
is this an anti-pattern or should one keep just the model params that perform better (a sort of model selection)?
is beaver addressing this issue too?