Process for Extracting Useful Knowledge from Different Volumes of Data

serkosi commented 2 years ago

I created an issue from my last comment for us to discuss if it makes sense or not.

Perhaps, it would be good idea to decide for a framework that we can follow while you work on ML pipeline and then we can identify the differences on both pipelines (ML for movies and ML for warehouses) through that framework steps.

1- Understand the application domain and the goal of the process 2- Create target dataset as a subset of all the data that is available 3- Data cleaning and preprocessing to remove noise, handling missing data and outliers 4- Data reduction and projection in order to focus on the features that are relevant to the problem 5- Match goals of process to the RBM method. 6- Decide the purpose of the model such as summarization or classification. 7- Machine learning, i.e. run algorithms on data. 8- Interpretation of learned patterns to make them understandable by the user, such as summarization and visualization. 9- Acting on the discovered knowledge, such as reporting or making decisions.

shahinmv commented 2 years ago

Well noted.

I will prepare a document consisting of RBM structure for our application. I will try to make it as clear as possible with every step, every function that will be implemented in the future for our RBM to work.

shahinmv commented 2 years ago

Step by step implementation of our RBM model.

Data processing
- We connect to the postgresql server.
- Copy data from ratings table to an array.
- Convert the data to numpy, we will use Pytorch tensor which requires array as input.
Data structure creation
- Create training set(also test set for MovieLens database) in array format.
- Each row represents a user and each cell in the row represents rating for each movie.
- If user did not rate a movie, initialize with 0.
- Convert sets into Tensor.
- Binary data conversion.
Model building
- Create RBM class, with all the parameters.
- Initialize the class
- Hidden node sampling function
  - Sampling the hidden nodes given the probabilities of hidden nodes
- Visible node sampling function
  - Similar logic described in previous indentation.
- Contrastive divergence function
  - RBM is an energy-based model which means we need to minimize the energy function.
Model training
Model testing
- Testing with MovieLens database
- Accumulate the loss for each prediction

serkosi commented 2 years ago

Which dataset are you going to use? MovieLens 25M Dataset? MovieLens Tag Genome Dataset 2021? Or one of the ones for education and development or older datasets?

Visible node sampling function carries the business logic characteristics, am I right? This function is the one we will modify when it comes to use it for warehouse domain rather than movie domain.

shahinmv commented 2 years ago

The datasets I downloaded for now is MovieLens 100k and MovieLens 1M. I will do the initial testing with these datasets and move up to a larger ones if everything goes fine.

Visible and hidden node sampling is for activating those nodes on based on probability, given the previous layer. Probability of visible layer is the sigmoid activation of hidden nodes. thus we multiply the hidden nodes by weight and add bias to them.

serkosi commented 2 years ago

Well-noted. After you finalise the algorithm pipeline in DNN project folder and get ready for testing attempt, I will check in more detail and we can have further discussions.

shahinmv commented 2 years ago

Our model uses Bernoulli Restricted Boltzmann Machines, a RBM with binary visible units and binary hidden units. For energy function, I am using K-step contrastive divergence.

shahinmv / warehouse-matching

Process for Extracting Useful Knowledge from Different Volumes of Data #7