Right now, we have some simple code that just records & uploads indiscriminately. For various reasons (storage costs, data cleanliness, user consent), we should aim to only collect sessions that matter.
With a data engine you're trying to find all the sessions that would actually provide a loss for your model. E.g. 20 million Google Searches wouldn't provide much loss each (because they'd quickly be learned), but that one-off usage of a website we've never seen before or some really complicated workflow in Figma, might be super high-loss (and hence valuable) for the model.
Problem is, we're not gonna be doing active learning with a model to start with, so you need approximations of the metric I described above ^
Some ideas are:
URLs / domain names; if we have a lot of data from X, don't collect more. Also, could weigh them by expected variance. E.g. DuckDuckGo, low variance, don't collect that many. Figma, high variance, collect more. And so on.
Uniqueness of URL walk
Amount of user interaction; mouse movements + keystrokes is ultimately what you have to predict, so the less there is, the less signal to train on, the more useless the session. (e.g. for driving datasets, don't sample driving straight, sample turns)
Right now, we have some simple code that just records & uploads indiscriminately. For various reasons (storage costs, data cleanliness, user consent), we should aim to only collect sessions that matter.
With a data engine you're trying to find all the sessions that would actually provide a loss for your model. E.g. 20 million Google Searches wouldn't provide much loss each (because they'd quickly be learned), but that one-off usage of a website we've never seen before or some really complicated workflow in Figma, might be super high-loss (and hence valuable) for the model.
Problem is, we're not gonna be doing active learning with a model to start with, so you need approximations of the metric I described above ^
Some ideas are:
I'm open to more ideas / feedback!