masa-finance / masa-bittensor

Masa Bittensor Subnet - Decentralized, Fair AI
https://masa.ai
MIT License
14 stars 11 forks source link

spike: twitter queries for improved fairness and usefulness #282

Closed grantdfoster closed 1 month ago

grantdfoster commented 1 month ago

Description

We currently define a configuration in config/twitter.json that defines a list of queries the validators as miners to mine. The list of queries is somewhat "random", and we can improve said list. Currently it focuses on crypto and web3 related topics, but this can be expanded on.

Some queries, like crypto analysis, often don't have many tweets associated w/ them, and it is unfair to miners who are asked to mine said query, as they won't return as many tweets as other, more broad queries. Furthermore, we need to better define what the downstream use case for the twitter data is - currently it just sits on validator hardware.

Ideas

It was proposed that perhaps the list of queries is dynamic, pulled from Twitter itself, perhaps the trending topics section, etc. This would both ensure volume AND usefulness, as we are harvesting data with the most volume / relevance.

Imagine we create an admin app through which we generate / update the list of queries, publish it to a public spot, same format as the current file. Validators pull this instead of the file in the repo to use for synthetic tweets.

Then also, using the protocol API we already created, validators post the raw synthetic tweet responses. We dedupe the tweets themselves and store them, and we keep stats on the miners and validators that returned them (identifying miners and validators by their hotkey / coldkeys)

grantdfoster commented 1 month ago

Quick hotfix for queries has been merged here https://github.com/masa-finance/masa-bittensor/pull/286

hide-on-bush-x commented 1 month ago

What about scraping something like https://trends24.in/, get the trends with major volume about crypto and and them to the query list?

hide-on-bush-x commented 1 month ago

I found a little hard to find a crypto specific trends list, seems like we would need to fetch an entire trending topics list and filter by "is crypto related"

I don't feel like doing a list check on each TT will be a good solution a lot could be missed, what about a little LLM that detects crypto topics on a list of TTs? sounds too complex

hide-on-bush-x commented 1 month ago

Imagine we create an admin app through which we generate / update the list of queries, publish it to a public spot, same format as the current file. Validators pull this instead of the file in the repo to use for synthetic tweets.

If this isn't an automated process, lets say we go and manually update the list, I would not build something new, a simple PR can do the same with the current list and we don't spend time on new tooling

grantdfoster commented 1 month ago

I will think more about the "trending topics" today...

As it stands, our current list of queries typically have a lot of volume (2,000+ tweets a day, 100 within the first hour of the new day). While pulling from a trending list is feasible, it's is also the "highest hanging" fruit, and there are other, easier ways to increase volume + miner demand. From easiest to hardest (and highest priority to lowest):

  1. Removing the current queries that tend to not return AS many tweets, i.e. crypto pump, crypto dump... to name a few
  2. Increase the count of tweets asked for in volume checking (currently 100). We've been using this spreadsheet to calculate how much we are "stressing" the protocol node. I would aim for at least 5x capacity, meaning each node has to successfully run 5 credentials to keep up with validator demand.
  3. Similar to above, look at increasing the amount of miners called for each volume check (currently at 10). This number works in conjunction with the tweet count.
  4. Can also look at cadence, though I would leave this untouched if possible, as it's currently set to 1/100th of a tempo which is an easy number to work with.
  5. Dynamic queries. If we find that given the increased volume from efforts above, our static queries don't return enough tweets, this then builds the case for implementing dynamic querying, where we specifically look for trending tweets where there is enough volume.
grantdfoster commented 1 month ago

Being discussed in https://github.com/masa-finance/masa-bittensor/issues/295

grantdfoster commented 1 month ago

Moving this to in review as MIP #2 (#295 ) captures this discussion!