NaN values when training

jvking / bp-lda

Backpropagation Latent Dirichlet Allocation (a third-party reimplementation of paper "End-to-end Learning of LDA by Mirror-Descent Back Propagation over a Deep Architecture" by Jianshu Chen et al.)

BSD 2-Clause "Simplified" License

15 stars 4 forks source link

NaN values when training #2

Open michaelchughes opened 7 years ago

michaelchughes commented 7 years ago

First, thanks for making this implementation open source! Your hard work is appreciated!

I've been trying the BP_sLDA code on a few small-size topic modeling problems, just to get a sense of its behavior relative to a few other baselines. I'm finding that the updates may not be numerically stable. Frequently, I see that after ~50 epochs or so the learned topic-word parameters become all NaN.

Dataset stats:

1000 docs
144 vocab words

Model stats:

10 topics (--nHid)

Alg stats:

batch_size 1000
phi learning rate = 0.05 (mu_Phi)
u learning rate = 0.05 (mu_U)

I've tried several values of the number of hidden layers (--nHidLayer of 10, 25, and 100). It generally seems that producing NaN is more likely with higher numbers of hidden layers. With 100 hidden layers, I tend to get NaNs every time, but with fewer, I get NaNs less often (though still occasionally).

I'm digging into which updates exactly might cause the NaN values, but no answers yet. Any ideas?

jvking commented 7 years ago

Thanks Mike, I will look into this issue. Can you first look into debug information and see if the gradient is too large? That's might cause NaN and you will need to decrease your learning rate, but anyway I will look into this further.

On Mon, Feb 27, 2017 at 3:27 PM, Mike Hughes notifications@github.com wrote:

First, thanks for making this implementation open source! Your hard work is appreciated!

I've been trying the BP_sLDA code on a few small-size topic modeling problems, just to get a sense of its behavior relative to a few other baselines. I'm finding that the updates may not be numerically stable. Frequently, I see that after ~50 updates or so the learned topic-word parameters become all NaN.

Dataset stats:

1000 docs

144 vocab words

Model stats:

10 topics (--nHid)

Alg stats:

batch_size 1000

phi learning rate = 0.05 (mu_Phi)

u learning rate = 0.05 (mu_U)

I've tried several values of the number of hidden layers (--nHidLayer of 10, 25, and 100). It generally seems that producing NaN is more likely with higher numbers of hidden layers. With 100 hidden layers, I tend to get NaNs every time, but with fewer, I get NaNs less often (though still occasionally).

I'm digging into which updates exactly might cause the NaN values, but no answers yet. Any ideas?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jvking/bp-lda/issues/2, or mute the thread https://github.com/notifications/unsubscribe-auth/AP19Cf9x4-1BZx9ydGrmayHdN7senvPNks5rg1vdgaJpZM4MNxE7 .

jvking commented 7 years ago

Hi Mike,

Also could you let us know your values for alpha and beta? If you set alpha < 1 and large T, that could be another possibility.

On Mon, Feb 27, 2017 at 3:31 PM, Ji He jvking@uw.edu wrote:

Thanks Mike, I will look into this issue. Can you first look into debug information and see if the gradient is too large? That's might cause NaN and you will need to decrease your learning rate, but anyway I will look into this further.

On Mon, Feb 27, 2017 at 3:27 PM, Mike Hughes notifications@github.com wrote:

First, thanks for making this implementation open source! Your hard work is appreciated!

I've been trying the BP_sLDA code on a few small-size topic modeling problems, just to get a sense of its behavior relative to a few other baselines. I'm finding that the updates may not be numerically stable. Frequently, I see that after ~50 updates or so the learned topic-word parameters become all NaN.

Dataset stats:

1000 docs

144 vocab words

Model stats:

10 topics (--nHid)

Alg stats:

batch_size 1000

phi learning rate = 0.05 (mu_Phi)

u learning rate = 0.05 (mu_U)

I've tried several values of the number of hidden layers (--nHidLayer of 10, 25, and 100). It generally seems that producing NaN is more likely with higher numbers of hidden layers. With 100 hidden layers, I tend to get NaNs every time, but with fewer, I get NaNs less often (though still occasionally).

I'm digging into which updates exactly might cause the NaN values, but no answers yet. Any ideas?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jvking/bp-lda/issues/2, or mute the thread https://github.com/notifications/unsubscribe-auth/AP19Cf9x4-1BZx9ydGrmayHdN7senvPNks5rg1vdgaJpZM4MNxE7 .

michaelchughes commented 7 years ago

I've set alpha = 1.0 and beta = 1.0. Here's some debug traces, where I think "G_Phi" is the printed value of the expression Grad.grad_Q_Phi.MaxAbsValue()

############## Epoch #1. BatchSize: 1000 Learning Rate: Phi:0.01, U:0.01 
Ep#1/1000 Bat#1/1. Loss=6.034. TrErr=0.000%. Speed=115 Samples/Sec.
muPhiMax=0.0001634134 
muPhiMin=1.898668E-05
AvgnHidLayerEff=100.0. G_Phi=967.724. G_U=0.013

############## Epoch #2. BatchSize: 1000 Learning Rate: Phi:0.01, U:0.01 
Ep#2/1000 Bat#1/1. Loss=4.396. TrErr=0.000%. Speed=109 Samples/Sec.
muPhiMax=0.0001407533
muPhiMin=2.436155E-05
AvgnHidLayerEff=100.0. G_Phi=458.980. G_U=0.005

############## Epoch #4. BatchSize: 1000 Learning Rate: Phi:0.01, U:0.01 
Ep#4/1000 Bat#1/1. Loss=NaN. TrErr=0.000%. Speed=133 Samples/Sec.
muPhiMax=NaN 
muPhiMin=NaN
AvgnHidLayerEff=100.0. G_Phi=NaN. G_U=NaN

jvking commented 7 years ago

Hi Mike,

Could you send us all your hyper-parameters?

With regards, Ji He

On Mon, Feb 27, 2017 at 4:50 PM, Mike Hughes notifications@github.com wrote:

I've set alpha = 1.0 and beta = 1.0. Here's some debug traces, where I think "G_Phi" is the printed value of the expression Grad.grad_Q_Phi.MaxAbsValue()

############## Epoch #1 https://github.com/jvking/bp-lda/issues/1. BatchSize: 1000 Learning Rate: Phi:0.01, U:0.01 ################## Ep#1/1000 Bat#1/1. Loss=6.034. TrErr=0.000%. Speed=115 Samples/Sec. muPhiMax=0.0001634134 muPhiMin=1.898668E-05 AvgnHidLayerEff=100.0. G_Phi=967.724. G_U=0.013

############## Epoch #2 https://github.com/jvking/bp-lda/issues/2. BatchSize: 1000 Learning Rate: Phi:0.01, U:0.01 ################## Ep#2/1000 Bat#1/1. Loss=4.396. TrErr=0.000%. Speed=109 Samples/Sec. muPhiMax=0.0001407533 muPhiMin=2.436155E-05 AvgnHidLayerEff=100.0. G_Phi=458.980. G_U=0.005

############## Epoch #4. BatchSize: 1000 Learning Rate: Phi:0.01, U:0.01 ################## Ep#4/1000 Bat#1/1. Loss=NaN. TrErr=0.000%. Speed=133 Samples/Sec. muPhiMax=NaN muPhiMin=NaN AvgnHidLayerEff=100.0. G_Phi=NaN. G_U=NaN

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jvking/bp-lda/issues/2#issuecomment-282906168, or mute the thread https://github.com/notifications/unsubscribe-auth/AP19CcYWw2odEfdQHiPzVSRylg0siYqWks5rg289gaJpZM4MNxE7 .

michaelchughes commented 7 years ago

Thanks for the rapid response! Here's my command. Any parameter not set here should be to its default.

/BP_sLDA/bin/Debug/BP_sLDA.exe
        --nHid 10
        --nHidLayer 100
        --nInput 144
        --nOutput 2
        --alpha 1.0
        --mu_Phi 0.01
        --mu_U 0.01
        --nEpoch 1000
        --BatchSize 1000
        --flag_DumpFeature false
        --nSamplesPerDisplay 1000
        --ThreadNum 1
        --MaxThreadDeg 1
        --nEpochPerSave 50

(note to self: this is job id 741152.1)

jvking commented 7 years ago

Hi Mike,

Thanks for emailing me the hyper parameters. We would suggest you try alpha/beta with 1.001 first, since the algorithm is usually more stable with alpha/beta > 1. Let me know if that still causes a problem.

On Mon, Feb 27, 2017 at 6:20 PM, Mike Hughes notifications@github.com wrote:

/BP_sLDA/bin/Debug/BP_sLDA.exe --nHid 10 --nHidLayer 100 --nInput 144 --nOutput 2 --alpha 1.0 --mu_Phi 0.01 --mu_U 0.01 --nEpoch 1000 --BatchSize 1000 --flag_DumpFeature false --nSamplesPerDisplay 1000 --ThreadNum 1 --MaxThreadDeg 1 --nEpochPerSave 50

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jvking/bp-lda/issues/2#issuecomment-282921571, or mute the thread https://github.com/notifications/unsubscribe-auth/AP19CdgTVnPDZDDfJ9ZAWcsCnsK0SCsYks5rg4RUgaJpZM4MNxE7 .

michaelchughes commented 7 years ago

Surprisingly (to me), trying the strictly convex hyperparameters (alpha = 1.001, tau=1.001) did in fact make a difference vs. the plain vanilla convex hyperparameters (alpha = 1.000, tau=1.000). In 3/3 runs where I get NaN values with the latter, the former is just fine. So, thanks for the suggestion.

However, I'd still like to raise the issue that using alpha < 1.0 is often desirable in practice. Isn't there something we can do to avoid NaNs in that regime? Probably just being a bit more careful with step sizes, exps, and logs in the gradient computation would lead to a numerically stable algorithm.

jvking commented 7 years ago

Great! Thanks for the feedback. Yes we suggested you to try alpha>1.0 because that's the most likely to work. For alpha=1.0 or alpha < 1.0, the algorithm can still work if you tune T carefully.

For example, if you are using the BP_sLDA codes for supervised learning. In Line 261-269 you will see we hard-coded the "paramModel.T_value" given different alpha conditions. This works fine with tasks in our original paper, but you might need to modify that part to a value that works in your scenario. We did not described this part in details so as not to confused users with too many parameters.

We are working on a version that hopefully can provide better adjustment to T_value given alpha=1.0 and alpha<1.0. Thanks for pointing out the issue. For now you need to hand-tune that parameter.

On Mon, Mar 6, 2017 at 7:37 AM, Mike Hughes notifications@github.com wrote:

Surprisingly (to me), trying the strictly convex hyperparameters (alpha = 1.001, tau=1.001) did in fact make a difference vs. the plain vanilla convex hyperparameters (alpha = 1.000, tau=1.000). In 3/3 runs where I get NaN values with the latter, the former is just fine. So, thanks for the suggestion.

However, I'd still like to raise the issue that using alpha < 1.0 is often desirable in practice. Isn't there something we can do to avoid NaNs in that regime? Probably just being a bit more careful with step sizes, exps, and logs in the gradient computation would lead to a numerically stable algorithm.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jvking/bp-lda/issues/2#issuecomment-284431974, or mute the thread https://github.com/notifications/unsubscribe-auth/AP19CQw-GrAy13PGkQLyeJYpYWDV6nDFks5rjCgrgaJpZM4MNxE7 .