Open minkkang opened 5 years ago
I think you are right. could you contribute a PR for our merge?
yes, i'll do the PR. Thanks.
@minkkang hi, shall you move GetLocalRate() behind regularization() as the latter also changes the learning_param.diff? it's to be exact same with the no fusion version.
@ftian1
Hi,
When i analyzed the "SGDFusion" version(GetLocalRate -> Regularize & update) and the "NON-SGDFusion" version(Regularize -> GetLocalRate -> update), i'm not sure but i think "SGDFusion version" is right if GetLocalRate(LARS) was executed after "Normalize".
When reading the LARS paper("LARGE BATCH TRAINING OF CONVOLUTIONAL NETWORKS WITH LAYER-WISE ADAPTIVE RATE SCALING[1]"),
the flow of LARS is as follow
Parameters: base LR γ0, momentum m, weight decay β, LARS coefficient η, number of steps T
g[t] ←∇L(w[t]) // obtain a stochastic gradient for the current mini-batch (1)
γ[t] ← γ0 * (1 − t/T)^2 // compute the global learning rate (2)
λ ← ||w[t]|| / (||g[t]|| + β * ||w[t]||) // compute the local LR λ (3)
v[t+1] ← mv[t] + γ[t+1] { ||w[t]|| / ( ||∇L(w[t])|| + β ||w[t]|| ) } ( ∇L(w[t]) + β w[t] )
w[t+1] ← w[t] - v[t+1] // update the weights (5)
But, the flow of "NON-SGDFusion" version is as follow
g[t] ← ∇L(w[t]) γ[t] ← γ0 * (1 − t/T)^2 // compute the global learning rate(2)
//Call Normalize function //Call Regularization function
g[t] ← ∇L(w[t]) + β * w[t]
//Call ComputeUpdateValue function
λ ← ||w[t]|| / ( ||∇L(w[t]) + β w[t]|| + β ||w[t]|| )
v[t+1] ← mv[t] + γ[t+1] ( ||w[t]|| / (||∇L(w[t]) + β w[t]|| + β ||w[t]|| ) ( ∇L(w[t]) + β * w[t] )
w[t+1] ← w[t] - v[t+1] // update the weights (5)
In this flow, v[t+1] value was changed.
// LARS original
v[t+1] ← mv[t] + γ[t+1] { ||w[t]|| / ( ||∇L(w[t])|| + β ||w[t]|| ) } ( ∇L(w[t]) + β w[t] )
// NON-SGDFusion
v[t+1] ← mv[t] + γ[t+1] { ||w[t]|| / ( ||∇L(w[t]) + β w[t]|| + β ||w[t]|| ) } ( ∇L(w[t]) + β * w[t] )
The flow of "SGDFusion" version is as follow
g[t] ← ∇L(w[t]) γ[t] ← γ0 * (1 − t/T)^2 // compute the global learning rate(2)
//Call SGDFusion function
λ ← ||w[t]|| / ( ||∇L(w[t])|| + β * ||w[t]|| )
//execute Normalize( it should be executed before getlocalLR)
v[t+1] ← mv[t] + γ[t+1] { ||w[t]|| / ( ||∇L(w[t])|| + β ||w[t]|| ) } ( ∇L(w[t]) + β w[t] )
w[t+1] ← w[t] - v[t+1] // update the weights (5)
In this flow, v[t+1] value is same.
v[t+1] ← mv[t] + γ[t+1] { ||w[t]|| / ( ||∇L(w[t])|| + β ||w[t]|| ) } ( ∇L(w[t]) + β w[t] ) // LARS original v[t+1] ← mv[t] + γ[t+1] { ||w[t]|| / ( ||∇L(w[t])|| + β ||w[t]|| ) } ( ∇L(w[t]) + β w[t] ) // SGDFusion
I think, "SGDFusion" version looks same as LARS algorithm in [1].
So i think we just have to change the flow of executing "GetLocalRate" after "normalization".
If i'm right, i'll change the "NON-SGDFusion".
@ftian1 Hi, May I change the flow of the no fusion version?
sorry for late respond due to Chinese New Year. Yes, I think your analysis is right. the non-fusion version should be updated.
Thank you for reply. I'll do a PR after changing the flow.
Hi i'm intel caffe user.
I think, i found the wrong flow of SGDFusion function (/sgd_solver.cpp).
When using GCC compiler or not using "iter_size", it doesn't make any problem. But, when using intel compiler and using "iter_size", LARS makes some problem.
As i know, when using intel compiler, SGD_FUSION option turns on.
In "SGD_FUSION" flow, it is executed in the order of "GetLocalRate(it includes LARS)", "normalize" , "regularization & update".
In this time, "normalize" divide "diff_data(mutable_cpu_diff or mutable_prv_diff)" by "iter_size". But, "LARS" is effected by sumsq_diff and sumsq_data.
So,i think "GetLocalRate" should be executed after "normalize".
After changing the SGD_FUSION flow("normalize" -> "GetLocalRate" -> "regularization & update"), LARS works fine.
Would you check the SGD_FUSION?