question about loss_weights

filefolder commented 3 years ago

Hi,

I notice that the default [detection,p,s] loss_weights in the trainer are [0.05, 0.40, 0.55] but the predictor/picker the default are [0.03, 0.40, 0.58].

Forgive me if this was somewhere in the paper, but should the loss_weights on the picker be set the same as the weights used to train a new model? Or should the picker loss_weights be adjusted based on the outcome F-scores of the new model? Just eyeballing but the weights seem to be maybe (1 - F1) of the detection, P, and S respectively, is there a metric similar to that that is generally advised?

Any advice appreciated!

smousavi05 commented 3 years ago

@filefolder yes the weight for the prediction need to be similar to those used in the model building. The difference here might be due to different models that I have built. These are how the loss terms for three tasks were weighted and summed up to make the overall loss that was used for the training.

filefolder commented 3 years ago

Thanks for your reply,

I am still trying to learn how to define these weights in a broader ML sense, but more specifically it would still be helpful to know if the trainer's loss weights were based on particular information (e.g. the stats of the model it was able to produce) or if it there is something else I am missing. I appreciate your patience in elaborating to a novice!

smousavi05 commented 3 years ago

no problem @filefolder . As I may mention before, these weight are different that those you might see in conventional deep learning models (that are mainly optimizer's weights). Here we have three separate loss functions (one for detection and one for each picker). So this is a multi-task network and three different loss functions that are associated to different tasks are minimizes simultaneously. The weighting basically defines the share of each of these individual losses in the overall optimization process. I have defined their values based on the inherent difficulty in each of these tasks. For instance, the detection labels are longer (i.e. relatively more samples points - usually several hundreds to a few thousand points - of the waveforms are labeled as earthquake by assigning 1 vs those labeled by 0 as noise compared to the phase picking which is to identify 40 sample points, among 6000 point (the input length), that are labeled as P or S arrivals) and as a result easier to detect (or learn to detect). P and S phase pickings are more difficult tasks to learn compared to the detection and S picking is more difficult than P picking (because it is contaminated by P coda). As a task becomes more difficult to learn, it may take longer to train a network. Thus if I assign equal weights to all the losses, the training will plato soon after the detection loss reduces to its minimum. Thus if I assign a larger weights to P and S picking, the training will continue for a longer time and it will be prioritized to put more efforts to learn (reducing the associated losses of) the harder tasks of S picking, P picking, and then Detection respectively. I hope this helps.

filefolder commented 3 years ago

It does help, thank you.

So if I may circle back to my initial post, assuming that the F1 scores I can achieve with my trained model are: Detection F1: 0.98 / P F1: 0.61 / S F1: 0.48

Would it be appropriate to attempt to re-train the model with adjusted loss_weights proportional to these in a similar scheme as one might devise for least squares weights?

Continuing the example, if we define "difficulty" as 1 - F1,

Detection difficulty: 0.02, P: 0.39, S: 0.52 (sum 0.93)

And then re-define (normalized) loss_weights as: [.02/,93, .39/.93, .52/.93] = [0.022, 0.419, 0.559]

Does this seem sensible? At which point you can re-train the model with these updated loss_weights in an iterative sense, although I suspect the benefit of doing so may be marginal.

As always I appreciate your willingness to answer all these questions

smousavi05 commented 3 years ago

Yes that is a long with the logic I explained and it is sensible. However, you should note that an iterative retraining and negligible changes in the weights won't change the overall model that much. The whole weighting scheme has some limited affect and there are other factors playing role in the final F1 score. The values you are getting now (i.e. 0.98 / P F1: 0.61 / S F1: 0.48) are pretty good and I don't think they can be improved much. All the results I presented in the paper are based on a model with similar F1 scores. More than these number that are based on validation set (and heavily affected by the difference/similarity of distributions in the augmented training set and non-augmented validation set), I it would be better to focus on the performance on your data/region of interest and try to improve it based on retraining or fine tuning.

filefolder commented 3 years ago

Cheers, thanks again, very much appreciate it. Closing

smousavi05 / EQTransformer

question about loss_weights #72