teslashibe commented 2 months ago

Problem

We do not understand or know what static data sets are most valuable. By conducting an analysis across the data collected form the sales funnel on Airtable and aggregating collected user feedback will allow to conduct an analysis of dataset opportunities by analyzing overlapping requests and use cases.

Podcasts are very low hanging fruit because they are easy to generate.

Acceptance Criteria

Certainly! I'll create an acceptance criteria checklist based on the problem statement you've provided. This checklist will help ensure that the solution adequately addresses the issue of identifying valuable static datasets, with a focus on podcast data.

Acceptance Criteria Checklist

Data Collection and Aggregation 📊

[x] Extract and compile sales funnel data from Airtable
[x] Aggregate user feedback from all available sources

Use Case Mapping and impact 🗺️

[x] Categorize identified datasets by potential use cases
[x] Create a matrix linking datasets to multiple applicable use cases
[x] Rank use cases by potential impact and ease of implementation

Opportunity Analysis 💡

[x] Create a prioritized list of dataset opportunities based on demand and feasibility
[x] Identify quick wins and low-hanging fruit among datasets

Technical Feasibility Assessment 🛠️

[ ] Assess feasibility of generating and maintaining each proposed dataset
[x] Identify potential technical challenges or limitations for each dataset
[ ] Evaluate required infrastructure and resources for dataset creation and storage
[x] Determine estimated time and effort for developing each dataset
[ ] Assess scalability of proposed solutions for growing dataset demands
[x] Identify any data privacy or security concerns related to proposed datasets
[x] Gather engineer recommendations for optimal implementation approaches

Actionable Recommendations 🚀

[x] Provide a clear list of recommended datasets to develop, starting with podcast data
[x] Outline potential challenges and resource requirements for each recommended dataset
[x] Suggest a phased approach for dataset development and release

Additional checklist from Brendan:

Podcasts:

Extract text, diarize, vectorize

[ ] Bankless
[ ] Huberman Lab

H34D commented 2 months ago

Spike, needs acceptance criteria

giovaroma commented 2 months ago

Added acceptance criteria.

teslashibe commented 2 months ago

@lacyg4 @giovaroma this is blocked by: https://github.com/masa-finance/roadmap/issues/38

TLDR we cannot brainstorm this until we decide which data sets we will have available to train models on FLock.io

giovaroma commented 2 months ago

Loom video outlining the outcome of the spike : https://www.loom.com/share/830b0f0c70624c9da47eaa3e15ac99a3?sid=841dc6b1-954f-4b3b-9e1d-63009521857c

Next steps @lacyg4 @teslashibe : Identify quick wins and low-hanging fruit among datasets.

giovaroma commented 2 months ago

Prioritized datasets opportunities and identified sources of data to feed the data set types by various channel. Reference the image below to see details. This list includes

Some specific topics to cover on the datasets can be:

memecoin

bitcoin

solana

NFT

DeFi

Action items: Scrape data and bundle by the last 30 days worth of data. We can limit it by 5k records to start.

@mudler @Luka-Loncar @lacyg4

Luka-Loncar commented 1 month ago

Should be picked up by eng team from this ticket https://github.com/orgs/masa-finance/projects/14?pane=issue&itemId=71468733

masa-finance / roadmap

spike: static data sets: Twitter, Discord, Web, Podcast (Diarization) #38

Problem

Acceptance Criteria

Acceptance Criteria Checklist

Podcasts:

memecoin

bitcoin

solana

NFT

DeFi