Question about the active learning strategy

fighting41love commented 6 years ago

Suppose an original dataset contains 100 samples (pre-train data) , we try to train a model using 1000 unlabelled(pool data). Active learning picks up 10 samples for each iteration. Question: Pretrain with 100 samples, we can get a model A. Then AL strategy selects 10 new samples. With the 100+10 samples, a) modAL uses model A to retrain on 110 samples; b) modAL initialize a new model B, and train on the 110 samples? which is right? In my opinion, a) is right. It is the way that modAL does, according to the codes. Could you pls figure out the differences between a) and b)? which one is better? Thanks!

cosmic-cortex commented 6 years ago

In modAL, this all depends on what the .fit() method of the model does. For scikit-learn estimators, AFAIK this retrains the model from scratch. For Keras models, it just performs the backpropagation algorithm for the data using the old weights. Note that modAL itself does not initialize any new objects for models, it keeps the original.

Currently, there are two options in modAL for retraining your model. 1) learner.fit(X_new, y_new, only_new=False) (this is default) Following your example, this calls the model's .fit() method passing all of the 110 examples. This should be used where the model is retrained from scratch. (Like for models in scikit-learn.) 2) learner.fit(X_new, y_new, only_new=True) This calls the .fit() method of the model only with X_new and y_new. It should be used for active learning with neural networks, when you may not want to do backpropagation on all of the known training data, because it might cause the model to overfit.

So, it cannot be stated that one version is better than the other. Each has its own use cases, and they should be used accordingly.

fighting41love commented 6 years ago

Got it. Thanks for the detailed explanation.

modAL-python / modAL

Question about the active learning strategy #27