intel / caffe

This fork of BVLC/Caffe is dedicated to improving performance of this deep learning framework when running on CPU, in particular Intel® Xeon processors.
Other
849 stars 491 forks source link

SGDFusion function issue #256

Open minkkang opened 5 years ago

minkkang commented 5 years ago

Hi i'm intel caffe user.

I think, i found the wrong flow of SGDFusion function (/sgd_solver.cpp).

When using GCC compiler or not using "iter_size", it doesn't make any problem. But, when using intel compiler and using "iter_size", LARS makes some problem.

As i know, when using intel compiler, SGD_FUSION option turns on.

In "SGD_FUSION" flow, it is executed in the order of "GetLocalRate(it includes LARS)", "normalize" , "regularization & update".

In this time, "normalize" divide "diff_data(mutable_cpu_diff or mutable_prv_diff)" by "iter_size". But, "LARS" is effected by sumsq_diff and sumsq_data.

So,i think "GetLocalRate" should be executed after "normalize".

After changing the SGD_FUSION flow("normalize" -> "GetLocalRate" -> "regularization & update"), LARS works fine.

Would you check the SGD_FUSION?

ftian1 commented 5 years ago

I think you are right. could you contribute a PR for our merge?

minkkang commented 5 years ago

yes, i'll do the PR. Thanks.

ftian1 commented 5 years ago

@minkkang hi, shall you move GetLocalRate() behind regularization() as the latter also changes the learning_param.diff? it's to be exact same with the no fusion version.

minkkang commented 5 years ago

@ftian1

Hi,

When i analyzed the "SGDFusion" version(GetLocalRate -> Regularize & update) and the "NON-SGDFusion" version(Regularize -> GetLocalRate -> update), i'm not sure but i think "SGDFusion version" is right if GetLocalRate(LARS) was executed after "Normalize".

When reading the LARS paper("LARGE BATCH TRAINING OF CONVOLUTIONAL NETWORKS WITH LAYER-WISE ADAPTIVE RATE SCALING[1]"),

the flow of LARS is as follow


Parameters: base LR γ0, momentum m, weight decay β, LARS coefficient η, number of steps T

g[t] ←∇L(w[t]) // obtain a stochastic gradient for the current mini-batch (1)

γ[t] ← γ0 * (1 − t/T)^2 // compute the global learning rate (2)

λ ← ||w[t]|| / (||g[t]|| + β * ||w[t]||) // compute the local LR λ (3)

// update the momentum (4) v[t+1] ← mv[t] + γ[t+1] λ ( g[t] + β * w[t] )

v[t+1] ← mv[t] + γ[t+1] { ||w[t]|| / ( ||∇L(w[t])|| + β ||w[t]|| ) } ( ∇L(w[t]) + β w[t] )

w[t+1] ← w[t] - v[t+1] // update the weights (5)


But, the flow of "NON-SGDFusion" version is as follow


g[t] ← ∇L(w[t]) γ[t] ← γ0 * (1 − t/T)^2 // compute the global learning rate(2)

//Call Normalize function //Call Regularization function

g[t] ← β * w[t] + g[t]

g[t] ← ∇L(w[t]) + β * w[t]

//Call ComputeUpdateValue function

// (3) compute the local LR λ λ ← ||w[t]|| / ( ||g[t]|| + β * ||w[t]|| )

λ ← ||w[t]|| / ( ||∇L(w[t]) + β w[t]|| + β ||w[t]|| )

// update the momentum (4) v[t+1] ← mv[t] + γ[t+1] λ g[t]

v[t+1] ← mv[t] + γ[t+1] ( ||w[t]|| / (||∇L(w[t]) + β w[t]|| + β ||w[t]|| ) ( ∇L(w[t]) + β * w[t] )

w[t+1] ← w[t] - v[t+1] // update the weights (5)


In this flow, v[t+1] value was changed.

// LARS original v[t+1] ← mv[t] + γ[t+1] { ||w[t]|| / ( ||∇L(w[t])|| + β ||w[t]|| ) } ( ∇L(w[t]) + β w[t] )
// NON-SGDFusion v[t+1] ← mv[t] + γ[t+1] { ||w[t]|| / ( ||∇L(w[t]) + β w[t]|| + β ||w[t]|| ) } ( ∇L(w[t]) + β * w[t] )

The flow of "SGDFusion" version is as follow


g[t] ← ∇L(w[t]) γ[t] ← γ0 * (1 − t/T)^2 // compute the global learning rate(2)

//Call SGDFusion function

λ ← ||w[t]|| / ( ||g[t]|| + β * ||w[t]|| ) // compute the local LR λ (3)

λ ← ||w[t]|| / ( ||∇L(w[t])|| + β * ||w[t]|| )

//execute Normalize( it should be executed before getlocalLR)

// update the momentum (4) v[t+1] ← mv[t] + γ[t+1] λ ( g[t] + β * w[t] )

v[t+1] ← mv[t] + γ[t+1] { ||w[t]|| / ( ||∇L(w[t])|| + β ||w[t]|| ) } ( ∇L(w[t]) + β w[t] )

w[t+1] ← w[t] - v[t+1] // update the weights (5)


In this flow, v[t+1] value is same.

v[t+1] ← mv[t] + γ[t+1] { ||w[t]|| / ( ||∇L(w[t])|| + β ||w[t]|| ) } ( ∇L(w[t]) + β w[t] ) // LARS original v[t+1] ← mv[t] + γ[t+1] { ||w[t]|| / ( ||∇L(w[t])|| + β ||w[t]|| ) } ( ∇L(w[t]) + β w[t] ) // SGDFusion

I think, "SGDFusion" version looks same as LARS algorithm in [1].

So i think we just have to change the flow of executing "GetLocalRate" after "normalization".

If i'm right, i'll change the "NON-SGDFusion".

minkkang commented 5 years ago

@ftian1 Hi, May I change the flow of the no fusion version?

ftian1 commented 5 years ago

sorry for late respond due to Chinese New Year. Yes, I think your analysis is right. the non-fusion version should be updated.

minkkang commented 5 years ago

Thank you for reply. I'll do a PR after changing the flow.