Epochs and loss during training

mp2893 / med2vec

Repository for Med2Vec project

BSD 3-Clause "New" or "Revised" License

220 stars 74 forks source link

Epochs and loss during training #20

Open wywhu opened 4 years ago

wywhu commented 4 years ago

Hi Ed,

I am training embedding using your default hyperparameters, except window_size. The minimum number of visits in my dataset is 2, but I set window_size=3 as I suppose your code can handle the inconsistency between window_size and actual sequence length. Am I right?

I also noticed that the mean_cost was the minimum at the 2nd epoch then it started increasing. Although I read in your paper that the number of epochs does not hurt the code representations very much, I am not sure which epoch should I choose after finished training. Should I used the minimum cost one, or the one from the last epoch?

mp2893 commented 4 years ago

Hi wywhu,

If you look at the code, masks are created during the training phase, so the mismatch between window_size and actual sequence length shouldn't be a problem. However, I wrote this code 4 years ago, so this is just speculation.

There is no fixed answer as to what number of epochs works best, as your dataset is different from what I had used. You can try to separate the cost into visit_cost and emb_cost (see line 133 of the source code), see how they behave, then select the epoch you like. This of course involves some coding.

Hope this helps, Ed

wywhu commented 4 years ago

Thanks Ed. I have another question about interpreting the code representations.

In your paper, it says that "we trained ReLU(W_c), a non-negative matrix, to represent the meaning of .......", and "we can find the top k code that have the largest values for the i-th coordinates by argsort(W_c[i, :])[1, k]".

I am confused, should I look at W_c or ReLU(W_c) in the argsort operation?

mp2893 commented 4 years ago

Actually, you are correct. You should look at ReLU(W_c) in the argsort operation, which guarantees non-negativity. However, since all medical codes are trained in the non-negative space, I don't think the results would be too different. But technically you should use ReLU(W_c). Thanks for pointing it out!