Closed rootAvish closed 6 years ago
It was clear in the initial stages of the POC itself that simply using a different library will not scale the approach to large ecosystems such as NPM, as a result we came up with a bunch of different deep learning approaches to solve the problem but finally settled on two parallel paths: The use of autoencoder based approaches and a matrix factorization approach (HPF) to solve the problem for NPM.
The findings are documented here: https://docs.google.com/document/d/1LTAJRq60lNs-fDGsZLkLS0r8E8j5zDxK_7zkXJqLRxM/edit
Based on the above document we'll be creating new issues that outline the steps required to complete this POC, so closing the issue instead of moving it to the new sprint.
/cc @sara-02 @krishnapaparaju @sivaavkd
User Story
As an OSIO/Fabric8-analytics IDE extension user I should be able to get companion/outlier insights for my stack via the new approach.
Acceptance Criteria
A working POC with multiple probabilistic approaches implemented (using Edward) and tested. The results/requirements/shortcomings of each for the NPM ecosystem should be documented at the end of this spike.
Description
The current approach uses a Bayesian network(hierarchical Bayesian inference) built on top of pomegranate. It works using exact inference and we want to try out some approximate inference methods in its place. In addition to the same using Edward, we want to leverage the power of the deep learning libraries that Edward is written on(Tensorflow, Keras) to try out a neural networks based approach that is currently SOTA.
Task List
Using the existing data available for NPM:
Collection of "big data" and re-training of models: