louislva / OpenActionData

Building a diverse and clean dataset of humans using the web. Open source.
https://open-action-data.vercel.app
4 stars 1 forks source link

Data-engine to identify interesting "sessions" #1

Open louislva opened 1 year ago

louislva commented 1 year ago

Right now, we have some simple code that just records & uploads indiscriminately. For various reasons (storage costs, data cleanliness, user consent), we should aim to only collect sessions that matter.

With a data engine you're trying to find all the sessions that would actually provide a loss for your model. E.g. 20 million Google Searches wouldn't provide much loss each (because they'd quickly be learned), but that one-off usage of a website we've never seen before or some really complicated workflow in Figma, might be super high-loss (and hence valuable) for the model.

Problem is, we're not gonna be doing active learning with a model to start with, so you need approximations of the metric I described above ^

Some ideas are:

I'm open to more ideas / feedback!