Issue of early stopping for sparsity

jiweiqi commented 3 years ago

Start a thread on the potential issue of early stopping on parameter inference (identify sparsity)

Currently, the rule of early stopping that we implemented is to stop training if the validation loss reaches a plateau. This is justified for deep learning with the goal of data fitting. For the deep learning model, the loss landscape is assumed to be flat near (sub) global minima, thus it is unnecessary to further training as long as the point is fell into that good valley.

Here, we additionally want a sparse model. But whether the model is sparse has little effect on the fitness by definition. Therefore, it is very likely early stopping will stop the model while the model is far away from sparse yet.

Instead, we shall train the model for a very long time.

Other tips I am not sure are:

small weight decay seems to be better than a large one
learning rate decay seems to be useful to further encourage sparsity, as shown below for model L1 (lr:5e-3 -> 1.e-3). The training loss has reached a plateau while the validation and testing loss are still gradually decreasing.

network: "beeline_networks/Synthetic_LI.csv"
ns: 7

tfinal: 20.0
ntotal: 20
batch_size: 16
epoch_size: -1

lr: 5.e-3
weight_decay: 1.e-5

n_mu: 3

n_exp_train: 5
n_exp_val: 5
n_exp_test: 5
noise: 0.01

n_iter_max: 100000
n_plot: 20 # frequency of callback

n_iter_buffer: 50
n_iter_burnin: 100
n_iter_tol: 10000
convergence_tol: 1e-8

drop_range:
   lb: -0.1
   ub: 0.1

jiweiqi commented 3 years ago

Alternatively, we could use weight pruning to encourage sparsity, although this assumes that all of the important wij has a large absolute value. Example at https://github.com/DENG-MIT/CRNN/blob/a94b20604fce305a55854a9e34c45fa2b28de8a8/case1/case1_hardthreshhold.jl#L76

DesmondYuan commented 3 years ago

All great!

I think pruning is a good idea. Does Julia have a similar prune training function as PyTorch/Keras?

jiweiqi commented 3 years ago

I don't think there is one in Julia. Normally, I do it manually. Actually, I don't retrain the model after pruning since I use a very tight threshold and the performance is almost unchanged after pruning.

DesmondYuan commented 3 years ago

Another observation is that for smaller datasets, too small lr would result in local optimum overfitting.

jiweiqi / CellBox.jl

Issue of early stopping for sparsity #15