when to stop - Githubissues

xin-huang commented 8 months ago

Hello @jalhackl

In my desktop, the training for 1m samples has taken more than 2 days. In the log file, it showed 25 epochs have been done, but the training is still continuing.

however, I found there is an n_early argument in

https://github.com/jalhackl/introunet/blob/d7972259749353243ac53da03263d9635d4ce326/intronets_train_replication.smk#L19

Did this argument work?

For 100k samples, the training stopped after 21 epochs in VSC-5

Also, I noticed the improvement of accuracy is little after 5 epochs, did you try how different number of epochs affect the performance? This is about hyperparameter optimization, we should think of a plan to explore this.

jalhackl commented 8 months ago

It is strange that it takes so long on your desktop. On the LISC, the training of the standard introunet takes approx. 15 hours. Given your GPU, I thought that it would be rather faster on your desktop.

n_early is set to 10 per default, so after 10 epochs without a significant improvement (decrease) in the validation loss, training stops. This works; for the standard introunet, it should stop after around 30 epochs.

Due to our highly imbalanced data, the change is accuracy during training is not really reliable, so one should perhaps only look at the validation loss. (For the same reason we use precision-recall curves in evaluation: In fact, it can be that a model with lower accuracy is 'better', because it produces some false positives, but at least also detects introgression, whereas a model with high accuracy only predicts 0 / not-introgressed for all items.)

However, the small changes in accuracy and validation loss probably indicate that after a few epochs the changes are not very big anymore.

xin-huang commented 8 months ago

I forgot to mention I am using the 192 polymorphisms with batch size = 64, what is your settings in the 15 hours case?

jalhackl commented 8 months ago

I forgot to mention I am using the 192 polymorphisms with batch size = 64, what is your settings in the 15 hours case?

The 15 hours are for 128 polymorphisms and batch size = 32 (the parameters from the introunet-paper). Indeed, it looks like training takes much longer with 192 polymorphisms.

xin-huang commented 8 months ago

finally, it stopped it took 244112.25484466553 seconds

jalhackl / introunet

when to stop #32