torch / nn

Other
1.34k stars 968 forks source link

slow down on CPU computation #707

Open sixin-zh opened 8 years ago

sixin-zh commented 8 years ago

I have observed slow down of CPU computation speed when using momentum sgd on a basic neural network. Not sure what happens, but below is the piece of code that could reproduce this slow down.

require 'torch'
require 'nn'
require 'sys'

torch.setdefaulttensortype('torch.FloatTensor')

local lr=1e-5
local mom=0.9

local im=torch.FloatTensor(1,1,32,32):fill(1)
local y=torch.FloatTensor(1):fill(1)

local H=800
local m=nn.Sequential()
m:add(nn.Reshape(1*32*32))
m:add(nn.Linear(32*32,H))
m:add(nn.Threshold())
m:add(nn.Linear(H,10))
m:add(nn.LogSoftMax())
local c=nn.ClassNLLCriterion()

local x,g = m:getParameters()
local v=x:clone():zero()

local T=100000000
for t=0,T do
   if t%100==0 then
      sys.tic()
   end

   if mom>0 then
      v:mul(mom)
      x:add(v)
   end
   g:zero()
   local preds = m:forward(im)
   local loss = c:forward(preds,y)
   local dldps = c:backward(preds,y)
   m:backward(im,dldps)
   if mom > 0 then
      v:add(-lr,g)
   end
   x:add(-lr,g)

   if t%100==0 then
      print(t,loss,sys.toc())
   end
end

The log is attached. test_mul.lua.txt

soumith commented 8 years ago

some basic things to check are:

  1. what blas is being used
  2. OMP_NUM_THREADS and MKL_NUM_THREADS environment variables are set to 4 or less.

Do you have a specific time-frame where things were faster?

sixin-zh commented 8 years ago

the slow down really means slow down over time. if i set mom=0, then this slow down disappears. i was expecting some constant speed behavoir, but with mom=0.9, the speed drops over time. i see this behavoir either with mkl or lapack. Both OMP_NUM_THREADS and MKL_NUM_THREADS are set to 1.

Using MKL, with mom=0.9, I got 0 2.5100255012512 0.011016130447388 100 0.66264617443085 0.0032329559326172 200 0.22685422003269 0.0031478404998779 300 0.12335408478975 0.0031149387359619 400 0.082214340567589 0.0030910968780518 500 0.060776494443417 0.0034520626068115 600 0.04783970117569 0.0031988620758057 700 0.039239812642336 0.0031299591064453 800 0.033146567642689 0.0074870586395264 900 0.028616171330214 0.008720874786377 1000 0.02512283064425 0.0095441341400146 1100 0.022353429347277 0.010173082351685 1200 0.02010739594698 0.010791063308716 1300 0.01825238391757 0.010673046112061 1400 0.016695393249393 0.011647939682007 1500 0.015371473506093 0.011360883712769 1600 0.014233120717108 0.011528968811035 1700 0.013244273141026 0.012006044387817 1800 0.012377787381411 0.011404991149902 1900 0.01161306258291 0.011420965194702 2000 0.01093322224915 0.011347055435181 2100 0.010325570590794 0.011332035064697 2200 0.0097787566483021 0.011455059051514 2300 0.0092844618484378 0.011557102203369 2400 0.0088354954496026 0.011486053466797 2500 0.0084261745214462 0.011472940444946 2600 0.0080514112487435 0.011543035507202 2700 0.0077072316780686 0.011377096176147 2800 0.0073897331021726 0.011581897735596 2900 0.0070963171310723 0.011600017547607 3000 0.006824295502156 0.011609792709351 3100 0.0065714125521481 0.011682987213135 3200 0.0063358806073666 0.01166296005249

But with mom=0, i got: 0 2.1382942199707 0.0051581859588623 100 1.8901830911636 0.0013101100921631 200 1.6635146141052 0.0013139247894287 300 1.4575979709625 0.0013079643249512 400 1.2749325037003 0.0013079643249512 500 1.1151279211044 0.0013480186462402 600 0.97617137432098 0.0013389587402344 700 0.85673636198044 0.0013830661773682 800 0.75441092252731 0.0013320446014404 900 0.66665828227997 0.0013210773468018 1000 0.59175890684128 0.0013298988342285 1100 0.52779859304428 0.0013201236724854 1200 0.47321870923042 0.0013260841369629 1300 0.42660436034203 0.0013449192047119 1400 0.38649094104767 0.0013198852539062 1500 0.35185977816582 0.001323938369751 1600 0.32182741165161 0.0013229846954346 1700 0.29569667577744 0.001295804977417 1800 0.27281478047371 0.0013279914855957 1900 0.25272524356842 0.001309871673584 2000 0.23495030403137 0.0013370513916016 2100 0.21916681528091 0.0013291835784912 2200 0.20511558651924 0.0013201236724854 2300 0.19252477586269 0.001317024230957 2400 0.18118913471699 0.0013518333435059 2500 0.17094573378563 0.0013079643249512 2600 0.1616600304842 0.0013132095336914 2700 0.15321642160416 0.0013070106506348 2800 0.14550840854645 0.0013649463653564 2900 0.13845765590668 0.001399040222168 3000 0.13199420273304 0.001323938369751 3100 0.1260517090559 0.0013618469238281 3200 0.12056498974562 0.0013010501861572 3300 0.11548553407192 0.001317024230957 3400 0.1107726469636 0.0013158321380615 3500 0.1063905954361 0.0013489723205566

also it's curious if you would comment out the forward and backward propagation, then this slow down may disappear as well.

soumith commented 8 years ago

try mom = 0.9 and initialize the initial momentum vector to be a random distribution. I suspect that in modern CPUs, an all-zeros or all-ones input is taken through a faster codepath.

sixin-zh commented 8 years ago

I tried but it becomes worse ..

using mom=0.9, and initialize v with local v=torch.rand(x:size())

0 0 0.007498025894165 100 0 0.0031180381774902 200 0 0.0031039714813232 300 0 0.0032730102539062 400 0 0.0037717819213867 500 0 0.0032417774200439 600 0 0.0032210350036621 700 0 0.0031719207763672 800 0 0.014176845550537 900 0 0.099176168441772 1000 0 0.09962010383606 1100 0 0.099490880966187 1200 0 0.099082946777344

soumith commented 8 years ago

at this point, i think it's just some kind of cache spilling pattern. The torch code itself involved here in the momentum calculation is stateless and is very very simple.

Maybe when momentum calculation is enabled, the cache is somehow being overspilled, and it comes into L2 or main mem. I verified that if you have a sufficiently small net (H=8), with our without momentum does not see this slowdown effect.

I'd say, verify it with a sufficiently large network, where I think you also wont have this issue.

sixin-zh commented 8 years ago

I am trying to understand what happens when a float tensor is multiplied by a float number. Is there any memory copy going on? c.f. https://github.com/torch/torch7/blob/master/generic/TensorOperator.c#L111

soumith commented 8 years ago

No memory copy for x:mul(0.5). Here, x is a tensor. Lines: https://github.com/torch/torch7/blob/master/TensorMath.lua#L286-L290 ensure that if t2 is not given, it is defaulted to t1, which is x. An equivalent call is: x:mul(x, 0.5), or torch.mul(x, x, 0.5)

Now that it goes into the C side of things, both tensor1 and tensor2 are given, so it directly goes into the else condition: https://github.com/torch/torch7/blob/master/generic/TensorOperator.c#L114

soumith commented 8 years ago

Actually you are looking at TensorOperator.c which also wrong. TensorOperator.c is for the operation: y = x * 0.5 The TensorMath.lua file that i pointed out gives the correct place where lua side goes into C.

sixin-zh commented 8 years ago

i see, so it is the following lines that's being run for x:mul(0.5)? https://github.com/torch/torch7/blob/master/lib/TH/generic/THTensorMath.c#L468-L475

soumith commented 8 years ago

yes

sixin-zh commented 8 years ago

thanks! so the question is what could cause the cache spilling? In the for loop of the test code, there should be no new memory access. why would we see the fast speed at the beginning?

soumith commented 8 years ago

i am definitely lost there, at this point I dont know. Modern processors are complex machinery. Maybe there is a turbo boost that can only happen in bursts, maybe cache spilling, maybe XXX.

soumith commented 8 years ago

Using Linx's perf tools should give much more insight. it will show cache spills, and trace hot code. https://perf.wiki.kernel.org/index.php/Main_Page