Closed pomodoren closed 3 years ago
After running the usual suspects (SGD, ASGD, Perceptron, Passive Aggressive (I and II)), then we could see that
did better than the others, and were more stable.
Also, after checking their training and prediction time we see that they are similar (<100ms difference).
So the choice wont matter that much. As ASGD prediction time (for 1000 instances) is quicker, then we will pick that. If we have any issue, we can change into standard SGD.
Additionally, lets read quickly about ASGD and SGD just as not to be ignorant.
So SGD defines the speed of the change: stochastic gradient descent of the linear classifier. Right now it has in the background an SVM classifier. Read more here. On the other hand ASGD is SGD with average=True.
average: bool or int, default=False
When set to True, computes the averaged SGD weights accross all updates and stores the
result in the coef_ attribute. If set to an int greater than 1, averaging will begin once the total
number of samples seen reaches average. So average=10 will begin averaging after seeing 10 samples.
The database basic table was set under #1 , an issue that just got solved. To do next would be to understand how to save the model in the database.
IMPORTANT The ingestion script should ingest data in batches and feed it to the model in batches.
Do not just pre-load all the data in advance. This is the "streaming" part of the challenge.
Batch-size? I guess we can keep the batch size as a CONFIG
value, and then load and train based on that. Remember that the model, the current page
are stored together, so maybe its implicated that BATCH_SIZE can be dependent on the documents per page. Still, this does not solve the issue of how do we test the model...
After #6, we have kind of decided the process of learning.
when load batch of 10
- check if Instance.count == N
- if yes, then train new model
- store model in PredictionModel Table
- check elif Instance.count() % N == 0 - if yes
- test existing model
- store stats results
- new model: train with the new N - this will wait for next input
- store new model in db
This can be a class method, because it does not depend that much on the ingestion-batch.
Bonus: Please save the model, the current page,
the coefficients and any relevant statistical measure
to the SQLite database (on a different table than "data") while you are updating it.
I am new at this, so do not have really specific idea of what is needed to save - and how we can use these later. Still, after searching around, I found something really interesting: Modellogger. I will check its code, and store the model in a similar way.
Before letting this go, I might kind of force a bit an integration ... Somehow to find a way to use the script of modellogger in the SQLAlchemy model structure.
(We do not think that it was a bad idea to store these with SQLAlchemy #1 , right?!)
These are additional notes regarding the process of: what to store. Source
Use case ( to remember what we were doing ):
The first question seems to be:
Here is described how to do it
Step by step