[Predictoor bot approach4] Extend approach3 to approach4 - Build/train "first & once", infer many times. Towards a single VM/process.

idiom-bytes commented 1 year ago

Motivation

Running e2e on a single VM has a lot of challenges. Parts of the system/logic are ready to support multiple feeds w/ a single agent/wallet, but parts of it are not.

Right now, the most stable way to run agents require 1 agent per feed/timeframe, inside individual containers.

The lowest friction is to run N feeds on a single agent, but the system can't quite handle this at the moment. Internal users are therefore running into issues when configuring their predictoors.

This Ticket/PR seeks to address various e2e issues towards running a single agent/vm across multiple feeds (on a singular timeframe, example: 1h).

[Agent3 experiences issues when running on a single thread]

predictoor_agent_3.py get_predictions() builds, trains, and predicts each round. This forces multiple containers/threads by default due to
predictoor_agent_3.py has many hardcoded values (signals, st_timestamp, model_ss, model_exchange) that make configuring a bit challenging
predictoor_agent_3.py get_prediction is not async so it locks up resources while building/training/inferring (yes, I understand ML models requires heavy HW resources). In a "build now infer later" model we can infer across N prediction feeds async.
base_predictoor_agent.py assumes that time does not pass between each getPrediction although it can be configured to serve multiple feeds w/ a single agent. This is because it's using the timestamp from the last_block rather than now() to identify epoch_s_left rather than datetime.now() as pictured below.

[Agent3 experiences issues with it's current model]

Default Agent3 config experiences NaN errors due to Binance dataset having NaN Values and throws exceptions (Linear model doesn't handle NaN while HistGradientBoost does)

DoD:

[ ] Agent builds/trains model once on startup for each feed, and then uses that same model to predict future outcomes.
[ ] Agent can be configured more easily and behaves according to the configuration
[ ] Agent is able to predict, sign nonces, and submit_predictions async in such a way that it can submit for 10 feeds for a single epoch
[ ] Agent checks if errors/retries happened and stops trying if there is no more time left to submit
[ ] data_factory patches NaN records w/ last_valid_record such that it can build/train/predict without experiencing any issues due to malformed data, improving DX

idiom-bytes commented 1 year ago

@trentmc I understand some of the intuitions as to why we're building dynamically.

However, me and @jfdelgad were talking about the dynamic model constantly rebuilding+training to predict every epoch. We had talked about how this seems like a bit of an anti-pattern since we can't measure the performance of a single model over time, we are instead measuring the performance of many individual model instances over time (looking forward to educating myself more on this).

I'm happy to focus on a single approach, my key concern here are:

Lowering required provisions to run a bot (single vm vs. container)
Making it easy to configure and deploy a predictoor agent to cover many feeds (i.e. 1 agent/container handling 10x 5m feeds)
Having an agent that runs e2e without having to think too much about NaN data or other details that make it challenging to configure/run Agent3

What I think maybe a limitation of the dynamic model to support N feeds, might instead be limitations of how we are submitting predictions via web3.py. Since it seems like there are async solutions for this, we may be able to simplify overall DX/ops further.

This ticket is encompassing several changes on my end so I can explore solutions towards a simpler configuration/VM, which I can then rollback into the system.

trentmc commented 1 year ago

It is not an antipattern. While you can't measure the performance of a model that is rebuilt every time, you can measure the performance of a model's algorithm and training parameters. I know this well: in my previous company we did exactly this. We had to ship tools that would build models completely blackbox that we couldn't see. We heavily benchmarked many approaches.

trentmc commented 1 year ago

But please do go ahead and go for your own approach 4.

idiom-bytes commented 1 year ago

@trentmc I do not want to create yet another Agent, so I'll just extend/override Agent3 functions so I can support the e2e on a single vm.

TLDR: Basically, Agent3 is currently implemented to support multiple feeds at the same time. But if you try to PAIR_FILTER : "BTC/USDT,ETH/USDT,ADA/USDT,...", it does not work well out of the box.

In practice: The linear model builds/trains/runs quickly... however if DataFactory needs to pull data this blocks other feeds from running in parallel. The key blocker right now though, is the blocking calls to submit a tx onchain.

I'm looking at the sapphire lib so we can separate the signing of the txs from the submitting, such that we can do the prediction>tx workflow async so nonces/signatures are handled correctly, and then batch broadcast as many as possible into a single blocking call. https://github.com/oceanprotocol/sapphire.py/blob/main/go/main.go

Perhaps the _process_block_at_feed() could be executed async to complete all get_predictions(), and whatever completes in 10s, 20s, 30s chunks gets batched into a single tx until s_left_to_submit < 0.

for addr in self.feeds:
self._process_block_at_feed(addr, block["timestamp"])

trentmc commented 1 year ago

Oops, closed wrong issue

idiom-bytes commented 1 year ago

I got the answer I was looking for.

Sapphire web3gateway does not allow you to queue txs, we discussed this before both internally and with Oasis team, please don't waste time on this.

TLDR; Perhaps prediction agent config should assert if user tries to run multiple feeds.

@KatunaNorbert as per the discussion and approach you were taking, that is not going to work very well. There isn't much here to improved at the moment. There is only way through at the moment, and that is running N agents/accounts to cover N feeds.

I'm going to close this ticket/branch and migrate improvements accordingly.

oceanprotocol / pdr-backend

[Predictoor bot approach4] Extend approach3 to approach4 - Build/train "first & once", infer many times. Towards a single VM/process. #327

Motivation

DoD: