masa-finance / roadmap

The protocol
0 stars 0 forks source link

spike: static data sets: Twitter, Discord, Web, Podcast (Diarization) #38

Closed teslashibe closed 2 months ago

teslashibe commented 2 months ago

Problem

We do not understand or know what static data sets are most valuable. By conducting an analysis across the data collected form the sales funnel on Airtable and aggregating collected user feedback will allow to conduct an analysis of dataset opportunities by analyzing overlapping requests and use cases.

Podcasts are very low hanging fruit because they are easy to generate.

Acceptance Criteria

Certainly! I'll create an acceptance criteria checklist based on the problem statement you've provided. This checklist will help ensure that the solution adequately addresses the issue of identifying valuable static datasets, with a focus on podcast data.

Acceptance Criteria Checklist

Data Collection and Aggregation πŸ“Š

Use Case Mapping and impact πŸ—ΊοΈ

Opportunity Analysis πŸ’‘

Technical Feasibility Assessment πŸ› οΈ

Actionable Recommendations πŸš€

Additional checklist from Brendan:

Podcasts:

Extract text, diarize, vectorize

H34D commented 2 months ago

Spike, needs acceptance criteria

giovaroma commented 2 months ago

Added acceptance criteria.

teslashibe commented 2 months ago

@lacyg4 @giovaroma this is blocked by: https://github.com/masa-finance/roadmap/issues/38

TLDR we cannot brainstorm this until we decide which data sets we will have available to train models on FLock.io

giovaroma commented 2 months ago

Loom video outlining the outcome of the spike : https://www.loom.com/share/830b0f0c70624c9da47eaa3e15ac99a3?sid=841dc6b1-954f-4b3b-9e1d-63009521857c

Next steps @lacyg4 @teslashibe : Identify quick wins and low-hanging fruit among datasets.

giovaroma commented 2 months ago

Prioritized datasets opportunities and identified sources of data to feed the data set types by various channel. Reference the image below to see details. This list includes

Some specific topics to cover on the datasets can be:

memecoin

bitcoin

solana

NFT

DeFi

Action items: Scrape data and bundle by the last 30 days worth of data. We can limit it by 5k records to start.

Image

@mudler @Luka-Loncar @lacyg4

Luka-Loncar commented 1 month ago

Should be picked up by eng team from this ticket https://github.com/orgs/masa-finance/projects/14?pane=issue&itemId=71468733