oceanprotocol / pdr-backend

Instructions & code to run predictoors, traders, more.
Apache License 2.0
24 stars 17 forks source link

[Super Epic] Analytics for predictoors & traders, to answer "How much $ am I making" then drill in #1328

Open trentmc opened 8 months ago

trentmc commented 8 months ago

Background / motivation

Our core users are predictoors & traders who use our pdr-backend python bots. We want to reduce friction for them.

Even though they operate largely in python-land, there are things we can do in the webapp to help them out.

Top (or near top) of the list is to help them answer the Q: "How much $ am I making / losing". There are many drill-down Q's that emerge from that.

This issue covers both traders & predictoors, because there will be overlap.

TODOs

Key reference: pdr-analytics prototypes (Gslides)

Related

KatunaNorbert commented 8 months ago

Predictoor Q’s: What was the total available rewards for the last x hours/x days? How many right/wrong predictions I had in the last x hours/x days? How many tokens I won/lost in the last x hours/x days predictions? How much I’m making compared to other predictoors? How much revenue comes in via sails?

Trader Q’s: How much did I spend on subscriptions in the last week? Which assets have the most sales?

trentmc commented 8 months ago

Predictoor Q's:

What I envision for rendering this data:

Slides:

trentmc commented 8 months ago

Update: I've done some first cut pencil-and-paper prototypes in the 2023 10 FE Prototypes GSlides. I'm sharing early to communicate how I'm thinking about this. It's a CAD tool for predictoors and traders! :)

I don't plan to spend more time at this right now. I will only when this overall issue becomes a priority, which might be soon and might not be soon.

idiom-bytes commented 8 months ago

Hi, as we need to implement accuracy calculations for the FE (2000 samples), I recommended @kdetry to start thinking of how to build this:

  1. inside pdr-backend
  2. using python

High level flow of how this might get used:

trentmc commented 8 months ago

When I presented the prototypes this past Thursday, I described how we can evolve from something super-simple to a high-quality webapp.

Here I flesh it out, as practical as possible, pointing to code that exists and that can be evolved.

The spectrum, from simplest first:

  1. Status quo. In pdr_backend "simulation flow" (pdr_backend/predictoor/approach3), locally generate & show matplotlib plots at the end of the sim. Show profit vs. time for traders, predicted vs actual for predictoors, etc. We already have a first-cut of this, and will improve organically. Eg #272 profit vs time for predictoors.
  2. Add realtime. In pdr_backend "simulation flow", locally generate & show matplotlib plots in realtime as the simulation progresses. pdr-backend#279
  3. Put in bot flow. In pdr_backend "predictoor bot flow" & "trader bot flow", for local bots, the bot itself generates & shows matplotlib plots in realtime. pdr-backend#280
  4. Plot from different process. In pdr_backend "bot flows", for local bots, a separate local process grabs chain data then generates & shows matplotlib plots in realtime. The separate process is a new directory pdr_backend/analytics. It uses subgraph query to grab chain data. Bonus: this allows bots to be run remotely too. pdr-backend#281
  5. Put in webapp. In pdr_backend "bot flows", for local or remote bots, a separate local pdr_backend/analytics process grabs chain data, then generates & renders pythonic plots into a webapp via streamlit or dash. The analytics process serves up an API consumed by webapp. pdr-backend#282
  6. Remote analytics service. The analytics process API is run as a remote web service.

We have (1). We can do a "tracer bullet" starting with (2) and going all the way through (6). Then we can continually flesh out plots at the level of (1-2: simulation flow), and as they mature we pull them into (4-6: analytics service).

Update: I converted (2-5) to github issues, linked above. And all this work is part of a new issue: "[EPIC] [Simulation, bots] Easy-to-use & powerful simulation --> predictoor/trader bot flow" pdr-backend#278

trentmc commented 8 months ago

@idiom-bytes wrt your comment of what @kdetry can be doing: it's really describing an architecture for step (6) in my comment above -- "remote analytics service"

Rather than directly jumping to (6), it might be wise to go through steps (2) - (5) first, tracer bullet style. This will ensure that we have a pipeline from rough prototypes (steps 1,2) all the way through to production remote analytics service (step 6).

Thoughts?

idiom-bytes commented 8 months ago

2000-sample Accuracy

I generally agree that the analytics system could be responsible for helping to execute the data/graphic work across all 6-features. However, since some of these are already working, and we want to onboard others, it might be easier to onboard with small tasks from (6) and then bring other existing workflows (1) over to this module.

Example: (6) right now is really small. It just needs 2k sample accuracy for 2x timeframes, so:

  1. leverage approach3/data_factory.py to do all the checkpointing/fetching
  2. pull all prediction results from subgraph
  3. calculate accuracy_5m, accuracy_1h
  4. serve these via flask/gunicorn
  5. serve these via the python module
  6. update pdr-web to fetch the accuracy from the server

How:

There is a script named data_factory.py which does some nice work to maintain a checkpoint of how much it has downloaded so far. I imagine that data_factory and some of the work that trent has done so far, would benefit from being abstracted and moved into something more general like /utils/ so other systems can use it.

Inside pdr-backend, you should be able to just import/instantiate/conigure a data_factory, and start using it.

Server vs. Local

The server:

The bot:

Further updates

I also propose data_factory gets an update to use polars + parquet to do this. It's incredibly fast, and will enable us to grow

trentmc commented 8 months ago

Thanks for the thoughts @idiom-bytes .

OK to a small (6) now, for the 2K thing. (Via the 2K github issue.)

Please don't use approach3/data_factory.py for that. It has completely student goals.

(Fyi I have a github issue to move the simulator stuff from approach3 directory to a more general place. That too is outside the scope of 2K work. And I want to do it when I get back because I know exactly what I want, and how to do it. So please don't do it in the meantime. Focus on 2K.)

KatunaNorbert commented 8 months ago

For (6) it's not clear for me, when talking about serving the accuracy analytics data via server do you mean using pdr-backend or a different service? If it's a different service I propose that we use pdr-websocket.

trentmc commented 8 months ago

For (6) it's a different service. Not pdr-backend.

I don't rule out pdr-websocket. I defer to you (Norbert) and Mustafa and Roberto.

idiom-bytes commented 8 months ago

[WRT pdr-websockets] This is just for having a pk that can talk to the contract + w/o exposing to the client. Which has been leading to all sorts of maintenance issues.

[WRT Websockets Forwardlooking] pdr-web + pdr-websockets should be nearly-frozen for now. pdr-ws has been nightmarish to support, a lot of code is getting duplicated/fragmented against pdr-web. Rather than building a pdr-fe-util lib to start addressing some of this problem... I think there is a solution to tech spike that would reduce this complexity by an order of magnitude.

How?

  1. Deploy next app as a UI client on vercel.
  2. Deploy the same next app as a pure server/backend w/PK on prod vm.
  3. Kill websocket and supporting another stack. It's now unnecessary (1)(2).
  4. (1) reads from (2)

(1) and (2) are deployed in separate environments but share the exact same stack. PK is not shown to client. We leverage more of next.js native functionality.

*** I have created Ticket oceanprotocol/pdr-web#283 in pdr-web to represent this

[WRT dApp/Predictoor Analytics (6)] Based on trents feedback... (A) I think leaderboards, epoch summaries, ecosystem metrics, and all sorts of things, should be written in python, in a clean module that is self-contained, atomic, and easy to import. (B) Rather than querying for GQL each time. This system should dump all data from subgraph, and build summaries for everything. This will look like an etl workflow. Only fetch what's needed, and update the data. Think parquet + dataframes. (C) As a pdr-trader, I'll want to query this system in addition to trained models that have obtained this data, as a way to understand other user behaviors, competitiveness across feeds, which ones are buying, and have high-level trading agents decide which feeds to use, or which predictoor feeds to submit to. (D) If desired, in the future, this service could sit in front of a GQL provider (E) As an app developer, I can easily query this data through remote/fetch/GET. (F) As a builder in pdr-backend, I can import this module, run the etl locally, and query the local cache directly from the app during my epoch updates. Example: Copy trading from known predictoors that are incredibly accurate. (G) As an ML engineer in pdr-backend, I can import the module, run the etl locally, and query the local cache directly to build my dataframes and features w/ behaviors from Predictoors. (H) If desired, this module could be easily extended w/ a FE to take all metrics/graphs/etc, and serve it to streamlit/etc...

*** I have created Ticket oceanprotocol/pdr-web#284 in pdr-web to represent this

[Final Remarks] pdr-websocket was primarily used to not expose a pk to the client. pdr-web will get bloated and this code will never be re-used if it ends up in there. This doesn't belong in pdr-websocket or pdr-web. Do not write it in JS either.

All of our data science and knowledge is being written in py. I want to be reading directly from the py stack. Please view this a data problem, not an app problem.

KatunaNorbert commented 7 months ago

Hey @trentmc, I was double checking on 'The spectrum, from simplest first' described above. Looks like you are assigned to the first step can I start working on the second step? There is already a fist cut of first step available that I could use to move things forward.

trentmc commented 7 months ago

Hey @trentmc, I was double checking on 'The spectrum, from simplest first' described above. Looks like you are assigned to the first step can I start working on the second step? There is already a fist cut of first step available that I could use to move things forward.

TBH I'd prefer to handle this myself, and the follow-up steps. I've finally got "Ship Predictoor DF" off my plate, and I intend to go through all these steps ASAP, and quickly. (Written as an EPIC in pdr-backend#278.)

FYI the "FE: backlog" column in DF/VE board has many items that could be covered.

KatunaNorbert commented 7 months ago

Ok, sure, sounds good. I was kind of expecting that you are going to go trough this that's why I wanted to check. Looks like Predictoor stats are a high priority now, since Predictoors are now able to make money showing this to community should help with incentivising people to onboard.

KatunaNorbert commented 7 months ago

For example a Predictoor leaderboard on the UI displaying the top x predictoors with their returns and accuracy I checked your prototype and is mainly focused on 'how much I make'. We might also want to have a section about: 'how much others make' so users ca see that they can make money before they start onboarding. Oh, NVM, I see there is a page for Predictoors, where you can see information about other Predictoors.

idiom-bytes commented 7 months ago

Hey Norbert, we get all of this out-of-the-box if we have the data and streamlit setup in a certain way. Let's continue to write down questions + design dashboards, and then figure out the data pipeline + tables we need to serve all of this.

image

example w/ a bronze->silver->gold pipeline: pdr_backend/data/gold/user_summary.parquet

On the streamlit side we can add a couple of dropdown and a text-field to serve the result:

trentmc commented 7 months ago

Architecture: from this Slack msg

Below is a design for analytics architecture, and its relation to pdr-backend. From a discussion among Berkay, Roberto, myself.

Have three separate repos:

  1. pdr-lake. Grabs on-chain data and CEX data, and stores as a "data lake" of parquet files. Continuously runs to update in real-time. To build the first cut of this, we'd move some of pdr_backend/data_eng/data_factory.py code here; as well some/all of pdr_backend/subgraph*.py. Probably also use "cryo" tool.
  2. pdr-analytics. Grabs data from the data lake by directly querying the parquet files (no REST API), and generates and renders interactive plots in the browser via matplotlib & streamlit.
  3. pdr-backend. Grabs data from the data lake, and supports the flows for simulation, pdr bot, pdr trader.

Usage:

Not near term: Only once the above is stabilized and the analytics fleshed out nicely (2+ mos from now), we can...

Near-term order-of-dev-work:

  1. Finish YAML & CLI in pdr-backend [Trent, w Berkay]
  2. In pdr-backend, refactor csvs -> parquet, and pandas -> polars. [Roberto]
  3. First-cut pdr-lake repo. Move code as appropriate from pdr-backend. Get it all running, including where pdr-backend consumes data from the lake. Just csvs & pandas. Outcome: following pdr-backend READMEs now involves using pdr-lake repo too. [Trent, likely others]
  4. First-cut pdr-analytics repo. First outcome: simple first-cut plots being rendered in the browser. [Roberto, Norbert, Mustafa]
  5. Then, we can iterate iterate on pdr-backend, pdr-lake, and pdr-analytics in parallel. (And any breaking changes to pdr-lake need to get propagated into pdr-backend and pdr-analytics) [all]

One more thing: keep Mustafa's new service for the accuracy estimation in pdr-backend for now. (Avoid rocking the boat here for now. Revisit when we make pdr-analytics live on predictoor.ai)