[Super Epic] Analytics for predictoors & traders, to answer "How much $ am I making" then drill in

oceanprotocol / pdr-backend

Instructions & code to run predictoors, traders, more.

Apache License 2.0

24 stars 17 forks source link

[Super Epic] Analytics for predictoors & traders, to answer "How much $ am I making" then drill in #1328

Open trentmc opened 8 months ago

trentmc commented 8 months ago

Background / motivation

Our core users are predictoors & traders who use our pdr-backend python bots. We want to reduce friction for them.

Even though they operate largely in python-land, there are things we can do in the webapp to help them out.

Top (or near top) of the list is to help them answer the Q: "How much $ am I making / losing". There are many drill-down Q's that emerge from that.

This issue covers both traders & predictoors, because there will be overlap.

TODOs

[x] Explore top-level Q's, and drill-down Q's. Along with possible ways to visualize. And where
[ ] Build it out, bit by bit
First cut of the predictoor dashboard done inside issue #1325

Key reference: pdr-analytics prototypes (Gslides)

pdr-web#48 is "[Predictoor FE] Backlog". It has some analytics-related issues. This issue needs to reconcile with that.
pdr-web#66 is "Predictoors Leaderboard". It's likely a subset of this, but useful to keep open for now because we'll almost certainly want to highlight the leaderboard; and we want to ensure that it doesn't get missed
(closed - duplicate) pdr-private#38 is "Fetch, visualize and monitor prediction metrics". There was overlap between that issue and this one. So we closed pdr-private#38.

KatunaNorbert commented 8 months ago

Predictoor Q’s: What was the total available rewards for the last x hours/x days? How many right/wrong predictions I had in the last x hours/x days? How many tokens I won/lost in the last x hours/x days predictions? How much I’m making compared to other predictoors? How much revenue comes in via sails?

Trader Q’s: How much did I spend on subscriptions in the last week? Which assets have the most sales?

trentmc commented 8 months ago

Predictoor Q's:

What's net income across all predictoors in the previous 24h? 7d? Where net income = (gross revenue) - (costs)
- Slice & dice: like gross revenue (see below)
- Also include: % gains over shown period, and equivalent APY (ie % gains annualized)
- Can ignore "slice & dice by contributions" since "gross revenue" & "cost" contributions cover it
What's gross revenue across all predictoors in the previous 24h? 7d? As a lump sum (scalar value), and value vs time (plot).
- Slice & dice: value (gross revenue) across my predictoors? (There may be >=1 address). Versus value across all other predictors.
- Slice & dice: value across specific predictoors? (E.g. the 3 with highest net income) (=Generalize "my")
- Slice & dice: value contribution per revenue component? (E.g. non-DF sales, DF sales, stake winnings)
- Slice & dice: value for specific trading venues, or pairs, or timescales; value for specific data nft (asset)
- Slice & dice: OCEAN component, ROSE component
- Slice & dice: rank the predictoors by: APY, highest net income first, highest revenue first, lowest cost first
- Slice & dice: rank the assets by: highest sales first, highest net income first, highest revenue first
What's costs across all predictoors in the previous 24h? 7d?
- Slice & dice: like gross revenue
- For cost contributions, components will be: stake slashings, tx fees, $ to buy prediction feeds)
- "Slice & dice: rank the predictoors by highest net income first" == predictoors leaderboard. We may wish to highlight this. Ref pdr-web#66 "Predictoors leaderboard"
What's prediction & accuracy across all predictoors in the previous 24h? 7d?
- Slice & dice: what was the predicted value? True value? Times when predicted == True
- Slice & dice: what was total # OCEAN staked? # OCEAN up? # OCEAN down?
- Slice & dice: what was total # predictions? (Over the whole time period, and per epoch if possible)
- Slice & dice: like gross revenue. With specific twists for accuracy
I'd like to monitor bots uptime, and debug as needed
- Possible bots: trueval, dfbuyer, my predictoors, all predictoors (where possible)
- For each bot: what are its logs? Can I download them?
- For each bot: what epochs did it "do its thing"? What was $OCEAN spent, $received, net $? $ROSE? Total in USD?
What does the model itself look like?
- Inspiration: this is unlike the ones above. Rather, it's shaped a lot like FDS. See "inspiration from Trent's prev work" below.

What I envision for rendering this data:

One or many plots of value vs time, taking up the right 2/3 of window
With >=1 widgets on the left 1/3 of window to do filtering
And complementary tables. TBD whether part of the plots, or separately (need to prototype)

Slides:

trentmc commented 8 months ago

Update: I've done some first cut pencil-and-paper prototypes in the 2023 10 FE Prototypes GSlides. I'm sharing early to communicate how I'm thinking about this. It's a CAD tool for predictoors and traders! :)

I don't plan to spend more time at this right now. I will only when this overall issue becomes a priority, which might be soon and might not be soon.

idiom-bytes commented 8 months ago

Hi, as we need to implement accuracy calculations for the FE (2000 samples), I recommended @kdetry to start thinking of how to build this:

inside pdr-backend
using python

High level flow of how this might get used:

pdr-analytics will compile all of this into a data-frame, aggregate, pre-compile, that can be served via an API endpoint rather than JIT via subgraph
pdr-analytics api/py will take an optional [list of feed addresses] to return individual or batch results
pdr-analytics api/py will be served to both FE clients, in addition to pdr-backend agents. FE can access via API while Agents can access directly through a .py function, or through their local/remote API.
pdr-trader will verify that the rolling accuracy on a feed is good in addition to having decent stake
pdr-trader may want to look at acc_last_2000, acc_last_500, and acc_last_50 samples for a particular feed before trading

trentmc commented 8 months ago

When I presented the prototypes this past Thursday, I described how we can evolve from something super-simple to a high-quality webapp.

Here I flesh it out, as practical as possible, pointing to code that exists and that can be evolved.

The spectrum, from simplest first:

Status quo. In pdr_backend "simulation flow" (pdr_backend/predictoor/approach3), locally generate & show matplotlib plots at the end of the sim. Show profit vs. time for traders, predicted vs actual for predictoors, etc. We already have a first-cut of this, and will improve organically. Eg #272 profit vs time for predictoors.
Add realtime. In pdr_backend "simulation flow", locally generate & show matplotlib plots in realtime as the simulation progresses. pdr-backend#279
Put in bot flow. In pdr_backend "predictoor bot flow" & "trader bot flow", for local bots, the bot itself generates & shows matplotlib plots in realtime. pdr-backend#280
Plot from different process. In pdr_backend "bot flows", for local bots, a separate local process grabs chain data then generates & shows matplotlib plots in realtime. The separate process is a new directory pdr_backend/analytics. It uses subgraph query to grab chain data. Bonus: this allows bots to be run remotely too. pdr-backend#281
Put in webapp. In pdr_backend "bot flows", for local or remote bots, a separate local pdr_backend/analytics process grabs chain data, then generates & renders pythonic plots into a webapp via streamlit or dash. The analytics process serves up an API consumed by webapp. pdr-backend#282
Remote analytics service. The analytics process API is run as a remote web service.

We have (1). We can do a "tracer bullet" starting with (2) and going all the way through (6). Then we can continually flesh out plots at the level of (1-2: simulation flow), and as they mature we pull them into (4-6: analytics service).

Update: I converted (2-5) to github issues, linked above. And all this work is part of a new issue: "[EPIC] [Simulation, bots] Easy-to-use & powerful simulation --> predictoor/trader bot flow" pdr-backend#278

trentmc commented 8 months ago

@idiom-bytes wrt your comment of what @kdetry can be doing: it's really describing an architecture for step (6) in my comment above -- "remote analytics service"

Rather than directly jumping to (6), it might be wise to go through steps (2) - (5) first, tracer bullet style. This will ensure that we have a pipeline from rough prototypes (steps 1,2) all the way through to production remote analytics service (step 6).

Thoughts?

idiom-bytes commented 8 months ago

2000-sample Accuracy

I generally agree that the analytics system could be responsible for helping to execute the data/graphic work across all 6-features. However, since some of these are already working, and we want to onboard others, it might be easier to onboard with small tasks from (6) and then bring other existing workflows (1) over to this module.

Example: (6) right now is really small. It just needs 2k sample accuracy for 2x timeframes, so:

leverage approach3/data_factory.py to do all the checkpointing/fetching
pull all prediction results from subgraph
calculate accuracy_5m, accuracy_1h
serve these via flask/gunicorn
serve these via the python module
update pdr-web to fetch the accuracy from the server

How:

There is a script named data_factory.py which does some nice work to maintain a checkpoint of how much it has downloaded so far. I imagine that data_factory and some of the work that trent has done so far, would benefit from being abstracted and moved into something more general like /utils/ so other systems can use it.

Inside pdr-backend, you should be able to just import/instantiate/conigure a data_factory, and start using it.

Server vs. Local

The server:

Can fetch from subgraph using data_factory, and then serve the results from disk.
remote pdr-web: should be able to get this data by fetching from a remote server deployed by pdr-backend.

The bot:

Can do the same, but it will be done internally/local rather than through a remote call.
bot/pdr-trader or another local service should be able to get this data by importing the python module internally, and fetching data from it's own local disk.

Further updates

I also propose data_factory gets an update to use polars + parquet to do this. It's incredibly fast, and will enable us to grow

trentmc commented 8 months ago

Thanks for the thoughts @idiom-bytes .

OK to a small (6) now, for the 2K thing. (Via the 2K github issue.)

Please don't use approach3/data_factory.py for that. It has completely student goals.

(Fyi I have a github issue to move the simulator stuff from approach3 directory to a more general place. That too is outside the scope of 2K work. And I want to do it when I get back because I know exactly what I want, and how to do it. So please don't do it in the meantime. Focus on 2K.)

KatunaNorbert commented 8 months ago

For (6) it's not clear for me, when talking about serving the accuracy analytics data via server do you mean using pdr-backend or a different service? If it's a different service I propose that we use pdr-websocket.

trentmc commented 8 months ago

For (6) it's a different service. Not pdr-backend.

I don't rule out pdr-websocket. I defer to you (Norbert) and Mustafa and Roberto.

idiom-bytes commented 8 months ago

[WRT pdr-websockets] This is just for having a pk that can talk to the contract + w/o exposing to the client. Which has been leading to all sorts of maintenance issues.

[WRT Websockets Forwardlooking] pdr-web + pdr-websockets should be nearly-frozen for now. pdr-ws has been nightmarish to support, a lot of code is getting duplicated/fragmented against pdr-web. Rather than building a pdr-fe-util lib to start addressing some of this problem... I think there is a solution to tech spike that would reduce this complexity by an order of magnitude.

How?

Deploy next app as a UI client on vercel.
Deploy the same next app as a pure server/backend w/PK on prod vm.
Kill websocket and supporting another stack. It's now unnecessary (1)(2).
(1) reads from (2)

(1) and (2) are deployed in separate environments but share the exact same stack. PK is not shown to client. We leverage more of next.js native functionality.

*** I have created Ticket oceanprotocol/pdr-web#283 in pdr-web to represent this

[WRT dApp/Predictoor Analytics (6)] Based on trents feedback... (A) I think leaderboards, epoch summaries, ecosystem metrics, and all sorts of things, should be written in python, in a clean module that is self-contained, atomic, and easy to import. (B) Rather than querying for GQL each time. This system should dump all data from subgraph, and build summaries for everything. This will look like an etl workflow. Only fetch what's needed, and update the data. Think parquet + dataframes. (C) As a pdr-trader, I'll want to query this system in addition to trained models that have obtained this data, as a way to understand other user behaviors, competitiveness across feeds, which ones are buying, and have high-level trading agents decide which feeds to use, or which predictoor feeds to submit to. (D) If desired, in the future, this service could sit in front of a GQL provider (E) As an app developer, I can easily query this data through remote/fetch/GET. (F) As a builder in pdr-backend, I can import this module, run the etl locally, and query the local cache directly from the app during my epoch updates. Example: Copy trading from known predictoors that are incredibly accurate. (G) As an ML engineer in pdr-backend, I can import the module, run the etl locally, and query the local cache directly to build my dataframes and features w/ behaviors from Predictoors. (H) If desired, this module could be easily extended w/ a FE to take all metrics/graphs/etc, and serve it to streamlit/etc...

*** I have created Ticket oceanprotocol/pdr-web#284 in pdr-web to represent this

[Final Remarks] pdr-websocket was primarily used to not expose a pk to the client. pdr-web will get bloated and this code will never be re-used if it ends up in there. This doesn't belong in pdr-websocket or pdr-web. Do not write it in JS either.

All of our data science and knowledge is being written in py. I want to be reading directly from the py stack. Please view this a data problem, not an app problem.

KatunaNorbert commented 7 months ago

Hey @trentmc, I was double checking on 'The spectrum, from simplest first' described above. Looks like you are assigned to the first step can I start working on the second step? There is already a fist cut of first step available that I could use to move things forward.

trentmc commented 7 months ago

Hey @trentmc, I was double checking on 'The spectrum, from simplest first' described above. Looks like you are assigned to the first step can I start working on the second step? There is already a fist cut of first step available that I could use to move things forward.

TBH I'd prefer to handle this myself, and the follow-up steps. I've finally got "Ship Predictoor DF" off my plate, and I intend to go through all these steps ASAP, and quickly. (Written as an EPIC in pdr-backend#278.)

FYI the "FE: backlog" column in DF/VE board has many items that could be covered.

KatunaNorbert commented 7 months ago

Ok, sure, sounds good. I was kind of expecting that you are going to go trough this that's why I wanted to check. Looks like Predictoor stats are a high priority now, since Predictoors are now able to make money showing this to community should help with incentivising people to onboard.

KatunaNorbert commented 7 months ago

For example a Predictoor leaderboard on the UI displaying the top x predictoors with their returns and accuracy I checked your prototype and is mainly focused on 'how much I make'. We might also want to have a section about: 'how much others make' so users ca see that they can make money before they start onboarding. Oh, NVM, I see there is a page for Predictoors, where you can see information about other Predictoors.

idiom-bytes commented 7 months ago

Hey Norbert, we get all of this out-of-the-box if we have the data and streamlit setup in a certain way. Let's continue to write down questions + design dashboards, and then figure out the data pipeline + tables we need to serve all of this.

example w/ a bronze->silver->gold pipeline: pdr_backend/data/gold/user_summary.parquet

user_address (str)
feed (str)
timeframe (str)
preds (int)
correct_preds (int)
acc (float)
earnings (float)
losses (float)
net (float)

On the streamlit side we can add a couple of dropdown and a text-field to serve the result:

Feed Dropdown -> Group/sort by feed (top predictoor per feed)
Timeframe Dropdown -> Group/sort by timeframe (top predictoor per feed)
Wallet List Text Field -> Group/sort by list of wallet addresses (1 whole user is a composite of multiple wallets)

trentmc commented 7 months ago

Architecture: from this Slack msg

Below is a design for analytics architecture, and its relation to pdr-backend. From a discussion among Berkay, Roberto, myself.

Have three separate repos:

pdr-lake. Grabs on-chain data and CEX data, and stores as a "data lake" of parquet files. Continuously runs to update in real-time. To build the first cut of this, we'd move some of pdr_backend/data_eng/data_factory.py code here; as well some/all of pdr_backend/subgraph*.py. Probably also use "cryo" tool.
pdr-analytics. Grabs data from the data lake by directly querying the parquet files (no REST API), and generates and renders interactive plots in the browser via matplotlib & streamlit.
pdr-backend. Grabs data from the data lake, and supports the flows for simulation, pdr bot, pdr trader.

Usage:

Each of us would always be running a pdr-lake service locally, and filling our own local data lake.
Then each of us could run pdr-analytics app, or a pdr-backend app (sim, pdr bot, trader bot) which consumes data from the lake.

Not near term: Only once the above is stabilized and the analytics fleshed out nicely (2+ mos from now), we can...

Remove friction in filling the initial data lake. Eg have a dedicated github repo storing historical data, like paradigm has.
Make pdr-analytics renderings easy for people to see from predictoor.ai. Eg by running pdr-lake and pdr-analytics' backend in the cloud, with small hooks in pdr-web to render the output of pdr-analytics' backend.
But explicitly do not work on either of those now, because it will get in the way of rapid development of the data lake and the pdr-analytics plots.

Near-term order-of-dev-work:

Finish YAML & CLI in pdr-backend [Trent, w Berkay]
In pdr-backend, refactor csvs -> parquet, and pandas -> polars. [Roberto]
First-cut pdr-lake repo. Move code as appropriate from pdr-backend. Get it all running, including where pdr-backend consumes data from the lake. Just csvs & pandas. Outcome: following pdr-backend READMEs now involves using pdr-lake repo too. [Trent, likely others]
First-cut pdr-analytics repo. First outcome: simple first-cut plots being rendered in the browser. [Roberto, Norbert, Mustafa]
Then, we can iterate iterate on pdr-backend, pdr-lake, and pdr-analytics in parallel. (And any breaking changes to pdr-lake need to get propagated into pdr-backend and pdr-analytics) [all]

One more thing: keep Mustafa's new service for the accuracy estimation in pdr-backend for now. (Avoid rocking the boat here for now. Revisit when we make pdr-analytics live on predictoor.ai)

oceanprotocol / pdr-backend

[Super Epic] Analytics for predictoors & traders, to answer "How much $ am I making" then drill in #1328

Background / motivation

TODOs

Related

2000-sample Accuracy

How:

Server vs. Local

Further updates