Open michaelchughes opened 7 years ago
Thanks Mike, I will look into this issue. Can you first look into debug information and see if the gradient is too large? That's might cause NaN and you will need to decrease your learning rate, but anyway I will look into this further.
On Mon, Feb 27, 2017 at 3:27 PM, Mike Hughes notifications@github.com wrote:
First, thanks for making this implementation open source! Your hard work is appreciated!
I've been trying the BP_sLDA code on a few small-size topic modeling problems, just to get a sense of its behavior relative to a few other baselines. I'm finding that the updates may not be numerically stable. Frequently, I see that after ~50 updates or so the learned topic-word parameters become all NaN.
Dataset stats:
- 1000 docs
- 144 vocab words
Model stats:
- 10 topics (--nHid)
Alg stats:
- batch_size 1000
- phi learning rate = 0.05 (mu_Phi)
- u learning rate = 0.05 (mu_U)
I've tried several values of the number of hidden layers (--nHidLayer of 10, 25, and 100). It generally seems that producing NaN is more likely with higher numbers of hidden layers. With 100 hidden layers, I tend to get NaNs every time, but with fewer, I get NaNs less often (though still occasionally).
I'm digging into which updates exactly might cause the NaN values, but no answers yet. Any ideas?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jvking/bp-lda/issues/2, or mute the thread https://github.com/notifications/unsubscribe-auth/AP19Cf9x4-1BZx9ydGrmayHdN7senvPNks5rg1vdgaJpZM4MNxE7 .
Hi Mike,
Also could you let us know your values for alpha and beta? If you set alpha < 1 and large T, that could be another possibility.
On Mon, Feb 27, 2017 at 3:31 PM, Ji He jvking@uw.edu wrote:
Thanks Mike, I will look into this issue. Can you first look into debug information and see if the gradient is too large? That's might cause NaN and you will need to decrease your learning rate, but anyway I will look into this further.
On Mon, Feb 27, 2017 at 3:27 PM, Mike Hughes notifications@github.com wrote:
First, thanks for making this implementation open source! Your hard work is appreciated!
I've been trying the BP_sLDA code on a few small-size topic modeling problems, just to get a sense of its behavior relative to a few other baselines. I'm finding that the updates may not be numerically stable. Frequently, I see that after ~50 updates or so the learned topic-word parameters become all NaN.
Dataset stats:
- 1000 docs
- 144 vocab words
Model stats:
- 10 topics (--nHid)
Alg stats:
- batch_size 1000
- phi learning rate = 0.05 (mu_Phi)
- u learning rate = 0.05 (mu_U)
I've tried several values of the number of hidden layers (--nHidLayer of 10, 25, and 100). It generally seems that producing NaN is more likely with higher numbers of hidden layers. With 100 hidden layers, I tend to get NaNs every time, but with fewer, I get NaNs less often (though still occasionally).
I'm digging into which updates exactly might cause the NaN values, but no answers yet. Any ideas?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jvking/bp-lda/issues/2, or mute the thread https://github.com/notifications/unsubscribe-auth/AP19Cf9x4-1BZx9ydGrmayHdN7senvPNks5rg1vdgaJpZM4MNxE7 .
I've set alpha = 1.0 and beta = 1.0. Here's some debug traces, where I think "G_Phi" is the printed value of the expression Grad.grad_Q_Phi.MaxAbsValue()
############## Epoch #1. BatchSize: 1000 Learning Rate: Phi:0.01, U:0.01
Ep#1/1000 Bat#1/1. Loss=6.034. TrErr=0.000%. Speed=115 Samples/Sec.
muPhiMax=0.0001634134
muPhiMin=1.898668E-05
AvgnHidLayerEff=100.0. G_Phi=967.724. G_U=0.013
############## Epoch #2. BatchSize: 1000 Learning Rate: Phi:0.01, U:0.01
Ep#2/1000 Bat#1/1. Loss=4.396. TrErr=0.000%. Speed=109 Samples/Sec.
muPhiMax=0.0001407533
muPhiMin=2.436155E-05
AvgnHidLayerEff=100.0. G_Phi=458.980. G_U=0.005
############## Epoch #4. BatchSize: 1000 Learning Rate: Phi:0.01, U:0.01
Ep#4/1000 Bat#1/1. Loss=NaN. TrErr=0.000%. Speed=133 Samples/Sec.
muPhiMax=NaN
muPhiMin=NaN
AvgnHidLayerEff=100.0. G_Phi=NaN. G_U=NaN
Hi Mike,
Could you send us all your hyper-parameters?
With regards, Ji He
On Mon, Feb 27, 2017 at 4:50 PM, Mike Hughes notifications@github.com wrote:
I've set alpha = 1.0 and beta = 1.0. Here's some debug traces, where I think "G_Phi" is the printed value of the expression Grad.grad_Q_Phi.MaxAbsValue()
############## Epoch #1 https://github.com/jvking/bp-lda/issues/1. BatchSize: 1000 Learning Rate: Phi:0.01, U:0.01 ################## Ep#1/1000 Bat#1/1. Loss=6.034. TrErr=0.000%. Speed=115 Samples/Sec. muPhiMax=0.0001634134 muPhiMin=1.898668E-05 AvgnHidLayerEff=100.0. G_Phi=967.724. G_U=0.013
############## Epoch #2 https://github.com/jvking/bp-lda/issues/2. BatchSize: 1000 Learning Rate: Phi:0.01, U:0.01 ################## Ep#2/1000 Bat#1/1. Loss=4.396. TrErr=0.000%. Speed=109 Samples/Sec. muPhiMax=0.0001407533 muPhiMin=2.436155E-05 AvgnHidLayerEff=100.0. G_Phi=458.980. G_U=0.005
############## Epoch #4. BatchSize: 1000 Learning Rate: Phi:0.01, U:0.01 ################## Ep#4/1000 Bat#1/1. Loss=NaN. TrErr=0.000%. Speed=133 Samples/Sec. muPhiMax=NaN muPhiMin=NaN AvgnHidLayerEff=100.0. G_Phi=NaN. G_U=NaN
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jvking/bp-lda/issues/2#issuecomment-282906168, or mute the thread https://github.com/notifications/unsubscribe-auth/AP19CcYWw2odEfdQHiPzVSRylg0siYqWks5rg289gaJpZM4MNxE7 .
Thanks for the rapid response! Here's my command. Any parameter not set here should be to its default.
/BP_sLDA/bin/Debug/BP_sLDA.exe
--nHid 10
--nHidLayer 100
--nInput 144
--nOutput 2
--alpha 1.0
--mu_Phi 0.01
--mu_U 0.01
--nEpoch 1000
--BatchSize 1000
--flag_DumpFeature false
--nSamplesPerDisplay 1000
--ThreadNum 1
--MaxThreadDeg 1
--nEpochPerSave 50
(note to self: this is job id 741152.1)
Hi Mike,
Thanks for emailing me the hyper parameters. We would suggest you try alpha/beta with 1.001 first, since the algorithm is usually more stable with alpha/beta > 1. Let me know if that still causes a problem.
On Mon, Feb 27, 2017 at 6:20 PM, Mike Hughes notifications@github.com wrote:
/BP_sLDA/bin/Debug/BP_sLDA.exe --nHid 10 --nHidLayer 100 --nInput 144 --nOutput 2 --alpha 1.0 --mu_Phi 0.01 --mu_U 0.01 --nEpoch 1000 --BatchSize 1000 --flag_DumpFeature false --nSamplesPerDisplay 1000 --ThreadNum 1 --MaxThreadDeg 1 --nEpochPerSave 50
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jvking/bp-lda/issues/2#issuecomment-282921571, or mute the thread https://github.com/notifications/unsubscribe-auth/AP19CdgTVnPDZDDfJ9ZAWcsCnsK0SCsYks5rg4RUgaJpZM4MNxE7 .
Surprisingly (to me), trying the strictly convex hyperparameters (alpha = 1.001, tau=1.001) did in fact make a difference vs. the plain vanilla convex hyperparameters (alpha = 1.000, tau=1.000). In 3/3 runs where I get NaN values with the latter, the former is just fine. So, thanks for the suggestion.
However, I'd still like to raise the issue that using alpha < 1.0 is often desirable in practice. Isn't there something we can do to avoid NaNs in that regime? Probably just being a bit more careful with step sizes, exps, and logs in the gradient computation would lead to a numerically stable algorithm.
Great! Thanks for the feedback. Yes we suggested you to try alpha>1.0 because that's the most likely to work. For alpha=1.0 or alpha < 1.0, the algorithm can still work if you tune T carefully.
For example, if you are using the BP_sLDA codes for supervised learning. In Line 261-269 you will see we hard-coded the "paramModel.T_value" given different alpha conditions. This works fine with tasks in our original paper, but you might need to modify that part to a value that works in your scenario. We did not described this part in details so as not to confused users with too many parameters.
We are working on a version that hopefully can provide better adjustment to T_value given alpha=1.0 and alpha<1.0. Thanks for pointing out the issue. For now you need to hand-tune that parameter.
On Mon, Mar 6, 2017 at 7:37 AM, Mike Hughes notifications@github.com wrote:
Surprisingly (to me), trying the strictly convex hyperparameters (alpha = 1.001, tau=1.001) did in fact make a difference vs. the plain vanilla convex hyperparameters (alpha = 1.000, tau=1.000). In 3/3 runs where I get NaN values with the latter, the former is just fine. So, thanks for the suggestion.
However, I'd still like to raise the issue that using alpha < 1.0 is often desirable in practice. Isn't there something we can do to avoid NaNs in that regime? Probably just being a bit more careful with step sizes, exps, and logs in the gradient computation would lead to a numerically stable algorithm.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jvking/bp-lda/issues/2#issuecomment-284431974, or mute the thread https://github.com/notifications/unsubscribe-auth/AP19CQw-GrAy13PGkQLyeJYpYWDV6nDFks5rjCgrgaJpZM4MNxE7 .
First, thanks for making this implementation open source! Your hard work is appreciated!
I've been trying the BP_sLDA code on a few small-size topic modeling problems, just to get a sense of its behavior relative to a few other baselines. I'm finding that the updates may not be numerically stable. Frequently, I see that after ~50 epochs or so the learned topic-word parameters become all NaN.
Dataset stats:
Model stats:
Alg stats:
I've tried several values of the number of hidden layers (--nHidLayer of 10, 25, and 100). It generally seems that producing NaN is more likely with higher numbers of hidden layers. With 100 hidden layers, I tend to get NaNs every time, but with fewer, I get NaNs less often (though still occasionally).
I'm digging into which updates exactly might cause the NaN values, but no answers yet. Any ideas?