On how to store the models in the database

pomodoren commented 3 years ago

The first question seems to be:

Firstly, develop a simple classification algorithm which attempts
to predict the variable "promoted" through the other variables. 
The focus of this model is pure prediction capability.

Here is described how to do it

Bonus: Please save the model, the current page,
the coefficients and any relevant statistical measure 
to the SQLite database (on a different table than "data") while you are updating it.

Step by step

[x] pick which model to play with
[x] understand fields that model needs ( create table )
[x] store these fields into SQLite ( problem might be pickling )
[x] load back
[x] connect to #1

pomodoren commented 3 years ago

Which is best model for prediction?

After running the usual suspects (SGD, ASGD, Perceptron, Passive Aggressive (I and II)), then we could see that

SGD
ASGD

did better than the others, and were more stable. Screenshot from 2021-05-27 22-33-40

Which model has best timing?

Also, after checking their training and prediction time we see that they are similar (<100ms difference). Screenshot from 2021-05-27 22-33-53

So the choice wont matter that much. As ASGD prediction time (for 1000 instances) is quicker, then we will pick that. If we have any issue, we can change into standard SGD.

Additionally, lets read quickly about ASGD and SGD just as not to be ignorant.

pomodoren commented 3 years ago

So SGD defines the speed of the change: stochastic gradient descent of the linear classifier. Right now it has in the background an SVM classifier. Read more here. On the other hand ASGD is SGD with average=True.

average: bool or int, default=False
When set to True, computes the averaged SGD weights accross all updates and stores the
result in the coef_ attribute. If set to an int greater than 1, averaging will begin once the total 
number of samples seen reaches average. So average=10 will begin averaging after seeing 10 samples.

pomodoren commented 3 years ago

The database basic table was set under #1 , an issue that just got solved. To do next would be to understand how to save the model in the database.

On another note

IMPORTANT The ingestion script should ingest data in batches and feed it to the model in batches.
Do not just pre-load all the data in advance. This is the "streaming" part of the challenge.

Batch-size? I guess we can keep the batch size as a CONFIG value, and then load and train based on that. Remember that the model, the current page are stored together, so maybe its implicated that BATCH_SIZE can be dependent on the documents per page. Still, this does not solve the issue of how do we test the model...

pomodoren commented 3 years ago

After #6, we have kind of decided the process of learning.

when load batch of 10
- check if Instance.count == N 
    - if yes, then train new model
    - store model in PredictionModel Table
- check elif Instance.count() % N == 0 - if yes
    - test existing model
    - store stats results
    - new model: train with the new N - this will wait for next input
    - store new model in db

This can be a class method, because it does not depend that much on the ingestion-batch.

pomodoren commented 3 years ago

Storing pickled data into SQLite

Bonus: Please save the model, the current page,
the coefficients and any relevant statistical measure 
to the SQLite database (on a different table than "data") while you are updating it.

I am new at this, so do not have really specific idea of what is needed to save - and how we can use these later. Still, after searching around, I found something really interesting: Modellogger. I will check its code, and store the model in a similar way.

Screenshot from 2021-05-28 09-02-42

Integration?

Before letting this go, I might kind of force a bit an integration ... Somehow to find a way to use the script of modellogger in the SQLAlchemy model structure.

Second thoughts?

(We do not think that it was a bad idea to store these with SQLAlchemy #1 , right?!)

pomodoren commented 3 years ago

Screenshot from 2021-05-28 09-13-46

Screenshot from 2021-05-28 09-15-15

These are additional notes regarding the process of: what to store. Source

pomodoren commented 3 years ago

Use case ( to remember what we were doing ):

[x] Load 1000 cases
[x] Update PredictionModel table with a method to take care of the checks
[x] Create a model, train it, save it with parameters
[x] Load 1000 new cases
[x] Test for these new cases
[x] Create a model, train it, save it with new parameters

pomodoren / qais