oceanprotocol / pdr-backend

Instructions & code to run predictoors, traders, more.
Apache License 2.0
33 stars 24 forks source link

[Predictoor bot approach4] Extend approach3 to approach4 - Build/train "first & once", infer many times. Towards a single VM/process. #327

Closed idiom-bytes closed 1 year ago

idiom-bytes commented 1 year ago

Motivation

Running e2e on a single VM has a lot of challenges. Parts of the system/logic are ready to support multiple feeds w/ a single agent/wallet, but parts of it are not.

Right now, the most stable way to run agents require 1 agent per feed/timeframe, inside individual containers.

The lowest friction is to run N feeds on a single agent, but the system can't quite handle this at the moment. Internal users are therefore running into issues when configuring their predictoors.

This Ticket/PR seeks to address various e2e issues towards running a single agent/vm across multiple feeds (on a singular timeframe, example: 1h).

[Agent3 experiences issues when running on a single thread]

  1. predictoor_agent_3.py get_predictions() builds, trains, and predicts each round. This forces multiple containers/threads by default due to
  2. predictoor_agent_3.py has many hardcoded values (signals, st_timestamp, model_ss, model_exchange) that make configuring a bit challenging
  3. predictoor_agent_3.py get_prediction is not async so it locks up resources while building/training/inferring (yes, I understand ML models requires heavy HW resources). In a "build now infer later" model we can infer across N prediction feeds async.
  4. base_predictoor_agent.py assumes that time does not pass between each getPrediction although it can be configured to serve multiple feeds w/ a single agent. This is because it's using the timestamp from the last_block rather than now() to identify epoch_s_left rather than datetime.now() as pictured below. image

[Agent3 experiences issues with it's current model]

  1. Default Agent3 config experiences NaN errors due to Binance dataset having NaN Values and throws exceptions (Linear model doesn't handle NaN while HistGradientBoost does)

DoD:

idiom-bytes commented 1 year ago

@trentmc I understand some of the intuitions as to why we're building dynamically.

However, me and @jfdelgad were talking about the dynamic model constantly rebuilding+training to predict every epoch. We had talked about how this seems like a bit of an anti-pattern since we can't measure the performance of a single model over time, we are instead measuring the performance of many individual model instances over time (looking forward to educating myself more on this).

I'm happy to focus on a single approach, my key concern here are:

What I think maybe a limitation of the dynamic model to support N feeds, might instead be limitations of how we are submitting predictions via web3.py. Since it seems like there are async solutions for this, we may be able to simplify overall DX/ops further.

This ticket is encompassing several changes on my end so I can explore solutions towards a simpler configuration/VM, which I can then rollback into the system.

trentmc commented 1 year ago

It is not an antipattern. While you can't measure the performance of a model that is rebuilt every time, you can measure the performance of a model's algorithm and training parameters. I know this well: in my previous company we did exactly this. We had to ship tools that would build models completely blackbox that we couldn't see. We heavily benchmarked many approaches.

trentmc commented 1 year ago

But please do go ahead and go for your own approach 4.

idiom-bytes commented 1 year ago

@trentmc I do not want to create yet another Agent, so I'll just extend/override Agent3 functions so I can support the e2e on a single vm.

TLDR: Basically, Agent3 is currently implemented to support multiple feeds at the same time. But if you try to PAIR_FILTER : "BTC/USDT,ETH/USDT,ADA/USDT,...", it does not work well out of the box.

In practice: The linear model builds/trains/runs quickly... however if DataFactory needs to pull data this blocks other feeds from running in parallel. The key blocker right now though, is the blocking calls to submit a tx onchain.

I'm looking at the sapphire lib so we can separate the signing of the txs from the submitting, such that we can do the prediction>tx workflow async so nonces/signatures are handled correctly, and then batch broadcast as many as possible into a single blocking call. https://github.com/oceanprotocol/sapphire.py/blob/main/go/main.go

Perhaps the _process_block_at_feed() could be executed async to complete all get_predictions(), and whatever completes in 10s, 20s, 30s chunks gets batched into a single tx until s_left_to_submit < 0.

for addr in self.feeds:
self._process_block_at_feed(addr, block["timestamp"])
trentmc commented 1 year ago

Oops, closed wrong issue

idiom-bytes commented 1 year ago

I got the answer I was looking for.

Sapphire web3gateway does not allow you to queue txs, we discussed this before both internally and with Oasis team, please don't waste time on this.

TLDR; Perhaps prediction agent config should assert if user tries to run multiple feeds.

@KatunaNorbert as per the discussion and approach you were taking, that is not going to work very well. There isn't much here to improved at the moment. There is only way through at the moment, and that is running N agents/accounts to cover N feeds.

I'm going to close this ticket/branch and migrate improvements accordingly.