Model for predictive analytics

mgechev commented 6 years ago

Current approach

At the moment, Guess.js uses a Markov chain in order to predict the next route the user will navigate to. We build the Markov chain by using a report fetched from Google Analytics (GA) where for each page path, we get the previous page path. The model has several advantages such as:

Doesn't require a lot of runtimes. The prediction can be made with a simple O(1) lookup so we don't have to ship a lot of code to the browser. On top of that, the matrix can be easily compressed, so even for large apps, its size will be reasonable.
Simplicity. We do not have to trace users' complex navigation patterns. Also, the model is quite easy to reason about

This approach has its own cons. We ignore a lot of potentially useful features such as:

More comprehensive navigation patterns including more than two pages visited in a session
navigator.locale & navigator.platform
etc.

Improving accuracy

We're thinking of exploring a more advanced model using neural networks. We've been looking at LSTM using tensorflow.js. Currently, there are few unknowns we need to research further, such as:

What'd be the most efficient way to extract longer navigation sequences from GA
What'd be the runtime that we should ship to the browser if we move to a predictive model using neural networks (not applicable to static websites)
How much the build-time will increase if we train the model while bundling the application (not applicable to static websites)
How to measure accuracy efficiently without violating users' privacy
Can we ship only part of the model to the browser without loading the entire tensorflow.js
What additional features to include in order to improve accuracy

Additional questions

The problem that we're solving looks quite similar to a recommender system and the path we've taken is collaborative filtering. Is it worth exploring content-based filtering or a mixture between the two?

felicitia commented 6 years ago

The idea is brilliant, but is your predictive model only based on the URLs of the sites? Some papers have pointed out that the performance bottleneck is actually sub-resource loading within one single request, such as images, js files, etc.

Some works that might be relevant: "Why are web browsers slow on smartphones?", 2011 "How far can client-only solutions go for mobile browser speed?", 2012 "Speeding up Web Page Loads with Shandian.", 2016 "Polaris: Faster Page Loads Using Fine-grained Dependency Tracking", 2016 "Crom: Faster Web Browsing Using Speculative Execution.", 2010

mgechev commented 6 years ago

Thanks for sharing all these resources!

Based on the report from Google Analytics, which provides mostly visits & transitions per URLs, we create a fine-grained mapping to individual resources by performing static analysis. Our first target is JavaScript, because it's expensive. On later stages we'll expand this to CSS, images, and other assets.

guess-js / guess