Generate stack recommendations using a probabilistic approach with Edward

rootAvish commented 6 years ago

User Story

As an OSIO/Fabric8-analytics IDE extension user I should be able to get companion/outlier insights for my stack via the new approach.

Acceptance Criteria

A working POC with multiple probabilistic approaches implemented (using Edward) and tested. The results/requirements/shortcomings of each for the NPM ecosystem should be documented at the end of this spike.

Description

The current approach uses a Bayesian network(hierarchical Bayesian inference) built on top of pomegranate. It works using exact inference and we want to try out some approximate inference methods in its place. In addition to the same using Edward, we want to leverage the power of the deep learning libraries that Edward is written on(Tensorflow, Keras) to try out a neural networks based approach that is currently SOTA.

Task List

Using the existing data available for NPM:

[ ] Implement the current hierarchical Bayesian inference model on Edward
- [ ] Implement the model
- [ ] Check the model for overfitting
- [ ] Calculate the precision of the predictions
- [ ] Try to use an approximate inference method instead of the exact inference currently in use.
[ ] Implement the SOTA hierarchical poisson factorization model using Edward and Tensorflow
- [ ] Implement the model
- [ ] Optimize the model hyperparameters
- [ ] Check the model for overfitting.
- [ ] Calculate the precision of the predictions
[x] Think along the lines of a Bayesian neural net to display recommendations

Collection of "big data" and re-training of models:

[x] Collect data for the NPM ecosystem
[x] Come up with a deep probabilistic modelling approach
[ ] Train the implemented models using the data collected for the NPM ecosystem
[x] Document the findings.

rootAvish commented 6 years ago

It was clear in the initial stages of the POC itself that simply using a different library will not scale the approach to large ecosystems such as NPM, as a result we came up with a bunch of different deep learning approaches to solve the problem but finally settled on two parallel paths: The use of autoencoder based approaches and a matrix factorization approach (HPF) to solve the problem for NPM.

The findings are documented here: https://docs.google.com/document/d/1LTAJRq60lNs-fDGsZLkLS0r8E8j5zDxK_7zkXJqLRxM/edit

Based on the above document we'll be creating new issues that outline the steps required to complete this POC, so closing the issue instead of moving it to the new sprint.

rootAvish commented 6 years ago

/cc @sara-02 @krishnapaparaju @sivaavkd

openshiftio / openshift.io