torch / optim

A numeric optimization package for Torch.
Other
197 stars 154 forks source link

ConfusionMatrix: Stochastic bug with batchAdd #141

Closed akhilsbehl closed 7 years ago

akhilsbehl commented 8 years ago

I'm running into a bug that appears and disappears for no apparent reason in the use of ConfustionMatrix.batchAdd. See these two consecutive runs:

     COMMND  asb  ~  git  mnist  src  th train.lua --printstep 20 --skiplog --cuda                       [284/403825]
    72 of 45000 training records will be unused per epoch.
    24 of 15000 validation records will be unused per epoch.
    [2016-10-14 18:40:09] Finished epoch = 1, batch = 20, with loss = 1.1431102752686.
    [2016-10-14 18:40:10] Finished epoch = 1, batch = 40, with loss = 0.98379397392273.
    [2016-10-14 18:40:10] Finished epoch = 1, batch = 60, with loss = 0.69640064239502.
    [2016-10-14 18:40:11] Finished epoch = 1, batch = 80, with loss = 0.53388464450836.
    [2016-10-14 18:40:11] Finished epoch = 1, batch = 100, with loss = 0.42102938890457.
    [2016-10-14 18:40:12] Finished epoch = 1, batch = 120, with loss = 0.69019424915314.
    [2016-10-14 18:40:13] Finished epoch = 1, batch = 140, with loss = 0.28126338124275.
    [2016-10-14 18:40:13] Finished epoch = 1, batch = 160, with loss = 0.31771036982536.
    [2016-10-14 18:40:14] Finished epoch = 1, batch = 180, with loss = 0.36902123689651.
    [2016-10-14 18:40:15] Finished epoch = 1, batch = 200, with loss = 0.15535597503185.
    [2016-10-14 18:40:15] Finished epoch = 1, batch = 220, with loss = 0.26898837089539.
    [2016-10-14 18:40:16] Finished epoch = 1, batch = 240, with loss = 0.2337928712368.
    [2016-10-14 18:40:16] Finished epoch = 1, batch = 260, with loss = 0.19574552774429.
    [2016-10-14 18:40:17] Finished epoch = 1, batch = 280, with loss = 0.37691986560822.
    [2016-10-14 18:40:18] Finished epoch = 1, batch = 300, with loss = 0.27491936087608.
    [2016-10-14 18:40:18] Finished epoch = 1, batch = 320, with loss = 0.36371386051178.
    [2016-10-14 18:40:19] Finished epoch = 1, batch = 340, with loss = 0.15922805666924.
    /home/asb/torch/install/bin/luajit: ...sb/torch/install/share/lua/5.1/optim/ConfusionMatrix.lua:117: bad argument #1 to
    'indexAdd' (out of range at /home/asb/torch/pkg/torch/lib/TH/generic/THTensor.c:729)
    stack traceback:
            [C]: in function 'indexAdd'
            ...sb/torch/install/share/lua/5.1/optim/ConfusionMatrix.lua:117: in function 'batchAdd'
            train.lua:153: in main chunk
            [C]: in function 'dofile'
            .../asb/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
            [C]: at 0x00405b80
     COMMND  asb  ~  git  mnist  src  th train.lua --printstep 20 --skiplog --cuda                         master 
    72 of 45000 training records will be unused per epoch.
    24 of 15000 validation records will be unused per epoch.
    [2016-10-14 18:43:17] Finished epoch = 1, batch = 20, with loss = 2.0140626430511.
    [2016-10-14 18:43:18] Finished epoch = 1, batch = 40, with loss = 0.97827231884003.
    [2016-10-14 18:43:18] Finished epoch = 1, batch = 60, with loss = 0.62330090999603.
    [2016-10-14 18:43:19] Finished epoch = 1, batch = 80, with loss = 0.73870342969894.
    [2016-10-14 18:43:19] Finished epoch = 1, batch = 100, with loss = 0.61164426803589.
    [2016-10-14 18:43:20] Finished epoch = 1, batch = 120, with loss = 0.40717771649361.
    [2016-10-14 18:43:21] Finished epoch = 1, batch = 140, with loss = 0.46196541190147.
    [2016-10-14 18:43:21] Finished epoch = 1, batch = 160, with loss = 0.7626816034317.
    [2016-10-14 18:43:22] Finished epoch = 1, batch = 180, with loss = 0.42969378829002.
    [2016-10-14 18:43:23] Finished epoch = 1, batch = 200, with loss = 0.42102152109146.
    [2016-10-14 18:43:23] Finished epoch = 1, batch = 220, with loss = 0.34528177976608.
    [2016-10-14 18:43:24] Finished epoch = 1, batch = 240, with loss = 0.32393988966942.
    [2016-10-14 18:43:24] Finished epoch = 1, batch = 260, with loss = 0.25361078977585.
    [2016-10-14 18:43:25] Finished epoch = 1, batch = 280, with loss = 0.35111820697784.
    [2016-10-14 18:43:26] Finished epoch = 1, batch = 300, with loss = 0.35840207338333.
    [2016-10-14 18:43:26] Finished epoch = 1, batch = 320, with loss = 0.19336950778961.
    [2016-10-14 18:43:27] Finished epoch = 1, batch = 340, with loss = 0.23242954909801.
    Total accuracy of classifier at completion of epoch 1 = 92.062784433365.
    Mean accuracy across classes at completion of epoch 1 = 92.140758547009.
    [2016-10-14 18:43:29] Finished epoch = 2, batch = 20, with loss = 0.45858466625214.
    [2016-10-14 18:43:29] Finished epoch = 2, batch = 40, with loss = 0.22427660226822.
    [2016-10-14 18:43:30] Finished epoch = 2, batch = 60, with loss = 0.2953850030899.
    [2016-10-14 18:43:31] Finished epoch = 2, batch = 80, with loss = 0.2055009752512.

The first run fails while the other succeeds with no changes whatsoever.

Moreover, arg 1 for indexAdd that is being reported in the stack trace is hard-coded to the value 1. So not sure how user code should even affect it.

My code is available here for reference.

Any ideas to debug this?

Thanks.

akhilsbehl commented 8 years ago

Sorry, this was a false positive.