uoguelph-mlrg / theano_alexnet

Theano-based Alexnet
BSD 3-Clause "New" or "Revised" License
229 stars 115 forks source link

error on Windows 10 #33

Open Magotraa opened 7 years ago

Magotraa commented 7 years ago

Hi, Thank you for the repository. I have installed the requirements and started the process as mentioned. I could prepare the prepossessed data. However, when I execute Train.py, " I get the error "ERROR"TypeError: Cannot convert Type TensorType(int32, vector) (of Variable <TensorType(int32, vector)>) into Type TensorType(int64, vector). You can try to manually convert <TensorType(int32, vector)> into a TensorType(int64, vector)."

hma02 commented 7 years ago

@aryanbhardwaj Could you provide the full Traceback of the error? I just want to see the files that produce this error.

Magotraa commented 7 years ago

@hma02 Thank you for your reply. I was able to resolve the error by making modification in alex_net.py, in line 26 y = T.ivector('y')

hma02 commented 7 years ago

This problem was also mentioned in #32

Magotraa commented 7 years ago

@hma02 yes, I did refer it. Thank you. However, I am still getting this issue. Can you please suggest some solution.

Error: epoch 56: validation loss nan epoch 56: validation error nan %

Complete Output is here:

WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10). Please switch to the gpuarray backend. You can get more information about how to switch at this URL: https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using gpu device 0: GeForce GTX 1080 (CNMeM is enabled with initial size: 80.0% of memory, cuDNN 5110) WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10). Please switch to the gpuarray backend. You can get more information about how to switch at this URL: https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

... building the model conv (cudnn) layer with shape_in: (3, 227, 227, 256) conv (cudnn) layer with shape_in: (96, 27, 27, 256) conv (cudnn) layer with shape_in: (256, 13, 13, 256) conv (cudnn) layer with shape_in: (384, 13, 13, 256) conv (cudnn) layer with shape_in: (384, 13, 13, 256) fc layer with num_in: 9216 num_out: 4096 dropout layer with P_drop: 0.5 fc layer with num_in: 4096 num_out: 4096 dropout layer with P_drop: 0.5 softmax layer with num_in: 4096 num_out: 1000 ... training epoch 1: validation loss nan epoch 1: validation error nan % weight saved: W_0_1 weight saved: b_0_1 weight saved: W0_1_1 weight saved: W1_1_1 weight saved: b0_1_1 weight saved: b1_1_1 weight saved: W_2_1 weight saved: b_2_1 weight saved: W0_3_1 weight saved: W1_3_1 weight saved: b0_3_1 weight saved: b1_3_1 weight saved: W0_4_1 weight saved: W1_4_1 weight saved: b0_4_1 weight saved: b1_4_1 weight saved: W_5_1 weight saved: b_5_1 weight saved: W_6_1 weight saved: b_6_1 weight saved: W_7_1 weight saved: b_7_1 epoch 2: validation loss nan epoch 2: validation error nan % weight saved: W_0_2 weight saved: b_0_2 weight saved: W0_1_2 weight saved: W1_1_2 weight saved: b0_1_2 weight saved: b1_1_2 weight saved: W_2_2 weight saved: b_2_2 weight saved: W0_3_2 weight saved: W1_3_2 weight saved: b0_3_2 weight saved: b1_3_2 weight saved: W0_4_2 weight saved: W1_4_2 weight saved: b0_4_2 weight saved: b1_4_2 weight saved: W_5_2 weight saved: b_5_2 weight saved: W_6_2 weight saved: b_6_2 weight saved: W_7_2 weight saved: b_7_2 epoch 3: validation loss nan epoch 3: validation error nan % weight saved: W_0_3 weight saved: b_0_3 weight saved: W0_1_3 weight saved: W1_1_3 weight saved: b0_1_3 weight saved: b1_1_3 weight saved: W_2_3 weight saved: b_2_3 weight saved: W0_3_3 weight saved: W1_3_3 weight saved: b0_3_3 weight saved: b1_3_3 weight saved: W0_4_3 weight saved: W1_4_3 weight saved: b0_4_3 weight saved: b1_4_3 weight saved: W_5_3 weight saved: b_5_3 weight saved: W_6_3 weight saved: b_6_3 weight saved: W_7_3 weight saved: b_7_3 epoch 4: validation loss nan epoch 4: validation error nan % weight saved: W_0_4 weight saved: b_0_4 weight saved: W0_1_4 weight saved: W1_1_4 weight saved: b0_1_4 weight saved: b1_1_4 weight saved: W_2_4 weight saved: b_2_4 weight saved: W0_3_4 weight saved: W1_3_4 weight saved: b0_3_4 weight saved: b1_3_4 weight saved: W0_4_4 weight saved: W1_4_4 weight saved: b0_4_4 weight saved: b1_4_4 weight saved: W_5_4 weight saved: b_5_4 weight saved: W_6_4 weight saved: b_6_4 weight saved: W_7_4 weight saved: b_7_4 epoch 5: validation loss nan epoch 5: validation error nan % weight saved: W_0_5 weight saved: b_0_5 weight saved: W0_1_5 weight saved: W1_1_5 weight saved: b0_1_5 weight saved: b1_1_5 weight saved: W_2_5 weight saved: b_2_5 weight saved: W0_3_5 weight saved: W1_3_5 weight saved: b0_3_5 weight saved: b1_3_5 weight saved: W0_4_5 weight saved: W1_4_5 weight saved: b0_4_5 weight saved: b1_4_5 weight saved: W_5_5 weight saved: b_5_5 weight saved: W_6_5 weight saved: b_6_5 weight saved: W_7_5 weight saved: b_7_5 epoch 6: validation loss nan epoch 6: validation error nan % weight saved: W_0_6 weight saved: b_0_6 weight saved: W0_1_6 weight saved: W1_1_6 weight saved: b0_1_6 weight saved: b1_1_6 weight saved: W_2_6 weight saved: b_2_6 weight saved: W0_3_6 weight saved: W1_3_6 weight saved: b0_3_6 weight saved: b1_3_6 weight saved: W0_4_6 weight saved: W1_4_6 weight saved: b0_4_6 weight saved: b1_4_6 weight saved: W_5_6 weight saved: b_5_6 weight saved: W_6_6 weight saved: b_6_6 weight saved: W_7_6 weight saved: b_7_6 epoch 7: validation loss nan epoch 7: validation error nan % weight saved: W_0_7 weight saved: b_0_7 weight saved: W0_1_7 weight saved: W1_1_7 weight saved: b0_1_7 weight saved: b1_1_7 weight saved: W_2_7 weight saved: b_2_7 weight saved: W0_3_7 weight saved: W1_3_7 weight saved: b0_3_7 weight saved: b1_3_7 weight saved: W0_4_7 weight saved: W1_4_7 weight saved: b0_4_7 weight saved: b1_4_7 weight saved: W_5_7 weight saved: b_5_7 weight saved: W_6_7 weight saved: b_6_7 weight saved: W_7_7 weight saved: b_7_7 epoch 8: validation loss nan epoch 8: validation error nan % weight saved: W_0_8 weight saved: b_0_8 weight saved: W0_1_8 weight saved: W1_1_8 weight saved: b0_1_8 weight saved: b1_1_8 weight saved: W_2_8 weight saved: b_2_8 weight saved: W0_3_8 weight saved: W1_3_8 weight saved: b0_3_8 weight saved: b1_3_8 weight saved: W0_4_8 weight saved: W1_4_8 weight saved: b0_4_8 weight saved: b1_4_8 weight saved: W_5_8 weight saved: b_5_8 weight saved: W_6_8 weight saved: b_6_8 weight saved: W_7_8 weight saved: b_7_8 epoch 9: validation loss nan epoch 9: validation error nan % weight saved: W_0_9 weight saved: b_0_9 weight saved: W0_1_9 weight saved: W1_1_9 weight saved: b0_1_9 weight saved: b1_1_9 weight saved: W_2_9 weight saved: b_2_9 weight saved: W0_3_9 weight saved: W1_3_9 weight saved: b0_3_9 weight saved: b1_3_9 weight saved: W0_4_9 weight saved: W1_4_9 weight saved: b0_4_9 weight saved: b1_4_9 weight saved: W_5_9 weight saved: b_5_9 weight saved: W_6_9 weight saved: b_6_9 weight saved: W_7_9 weight saved: b_7_9 epoch 10: validation loss nan epoch 10: validation error nan % ('Learning rate changed to:', array(0.0009999999310821295, dtype=float32)) weight saved: W_0_10 weight saved: b_0_10 weight saved: W0_1_10 weight saved: W1_1_10 weight saved: b0_1_10 weight saved: b1_1_10 weight saved: W_2_10 weight saved: b_2_10 weight saved: W0_3_10 weight saved: W1_3_10 weight saved: b0_3_10 weight saved: b1_3_10 weight saved: W0_4_10 weight saved: W1_4_10 weight saved: b0_4_10 weight saved: b1_4_10 weight saved: W_5_10 weight saved: b_5_10 weight saved: W_6_10 weight saved: b_6_10 weight saved: W_7_10 weight saved: b_7_10 epoch 11: validation loss nan epoch 11: validation error nan % weight saved: W_0_11 weight saved: b_0_11 weight saved: W0_1_11 weight saved: W1_1_11 weight saved: b0_1_11 weight saved: b1_1_11 weight saved: W_2_11 weight saved: b_2_11 weight saved: W0_3_11 weight saved: W1_3_11 weight saved: b0_3_11 weight saved: b1_3_11 weight saved: W0_4_11 weight saved: W1_4_11 weight saved: b0_4_11 weight saved: b1_4_11 weight saved: W_5_11 weight saved: b_5_11 weight saved: W_6_11 weight saved: b_6_11 weight saved: W_7_11 weight saved: b_7_11 epoch 12: validation loss nan epoch 12: validation error nan % weight saved: W_0_12 weight saved: b_0_12 weight saved: W0_1_12 weight saved: W1_1_12 weight saved: b0_1_12 weight saved: b1_1_12 weight saved: W_2_12 weight saved: b_2_12 weight saved: W0_3_12 weight saved: W1_3_12 weight saved: b0_3_12 weight saved: b1_3_12 weight saved: W0_4_12 weight saved: W1_4_12 weight saved: b0_4_12 weight saved: b1_4_12 weight saved: W_5_12 weight saved: b_5_12 weight saved: W_6_12 weight saved: b_6_12 weight saved: W_7_12 weight saved: b_7_12 epoch 13: validation loss nan epoch 13: validation error nan % weight saved: W_0_13 weight saved: b_0_13 weight saved: W0_1_13 weight saved: W1_1_13 weight saved: b0_1_13 weight saved: b1_1_13 weight saved: W_2_13 weight saved: b_2_13 weight saved: W0_3_13 weight saved: W1_3_13 weight saved: b0_3_13 weight saved: b1_3_13 weight saved: W0_4_13 weight saved: W1_4_13 weight saved: b0_4_13 weight saved: b1_4_13 weight saved: W_5_13 weight saved: b_5_13 weight saved: W_6_13 weight saved: b_6_13 weight saved: W_7_13 weight saved: b_7_13 epoch 14: validation loss nan epoch 14: validation error nan % weight saved: W_0_14 weight saved: b_0_14 weight saved: W0_1_14 weight saved: W1_1_14 weight saved: b0_1_14 weight saved: b1_1_14 weight saved: W_2_14 weight saved: b_2_14 weight saved: W0_3_14 weight saved: W1_3_14 weight saved: b0_3_14 weight saved: b1_3_14 weight saved: W0_4_14 weight saved: W1_4_14 weight saved: b0_4_14 weight saved: b1_4_14 weight saved: W_5_14 weight saved: b_5_14 weight saved: W_6_14 weight saved: b_6_14 weight saved: W_7_14 weight saved: b_7_14 epoch 15: validation loss nan epoch 15: validation error nan % weight saved: W_0_15 weight saved: b_0_15 weight saved: W0_1_15 weight saved: W1_1_15 weight saved: b0_1_15 weight saved: b1_1_15 weight saved: W_2_15 weight saved: b_2_15 weight saved: W0_3_15 weight saved: W1_3_15 weight saved: b0_3_15 weight saved: b1_3_15 weight saved: W0_4_15 weight saved: W1_4_15 weight saved: b0_4_15 weight saved: b1_4_15 weight saved: W_5_15 weight saved: b_5_15 weight saved: W_6_15 weight saved: b_6_15 weight saved: W_7_15 weight saved: b_7_15 epoch 16: validation loss nan epoch 16: validation error nan % weight saved: W_0_16 weight saved: b_0_16 weight saved: W0_1_16 weight saved: W1_1_16 weight saved: b0_1_16 weight saved: b1_1_16 weight saved: W_2_16 weight saved: b_2_16 weight saved: W0_3_16 weight saved: W1_3_16 weight saved: b0_3_16 weight saved: b1_3_16 weight saved: W0_4_16 weight saved: W1_4_16 weight saved: b0_4_16 weight saved: b1_4_16 weight saved: W_5_16 weight saved: b_5_16 weight saved: W_6_16 weight saved: b_6_16 weight saved: W_7_16 weight saved: b_7_16 epoch 17: validation loss nan epoch 17: validation error nan % weight saved: W_0_17 weight saved: b_0_17 weight saved: W0_1_17 weight saved: W1_1_17 weight saved: b0_1_17 weight saved: b1_1_17 weight saved: W_2_17 weight saved: b_2_17 weight saved: W0_3_17 weight saved: W1_3_17 weight saved: b0_3_17 weight saved: b1_3_17 weight saved: W0_4_17 weight saved: W1_4_17 weight saved: b0_4_17 weight saved: b1_4_17 weight saved: W_5_17 weight saved: b_5_17 weight saved: W_6_17 weight saved: b_6_17 weight saved: W_7_17 weight saved: b_7_17 epoch 18: validation loss nan epoch 18: validation error nan % weight saved: W_0_18 weight saved: b_0_18 weight saved: W0_1_18 weight saved: W1_1_18 weight saved: b0_1_18 weight saved: b1_1_18 weight saved: W_2_18 weight saved: b_2_18 weight saved: W0_3_18 weight saved: W1_3_18 weight saved: b0_3_18 weight saved: b1_3_18 weight saved: W0_4_18 weight saved: W1_4_18 weight saved: b0_4_18 weight saved: b1_4_18 weight saved: W_5_18 weight saved: b_5_18 weight saved: W_6_18 weight saved: b_6_18 weight saved: W_7_18 weight saved: b_7_18 epoch 19: validation loss nan epoch 19: validation error nan % weight saved: W_0_19 weight saved: b_0_19 weight saved: W0_1_19 weight saved: W1_1_19 weight saved: b0_1_19 weight saved: b1_1_19 weight saved: W_2_19 weight saved: b_2_19 weight saved: W0_3_19 weight saved: W1_3_19 weight saved: b0_3_19 weight saved: b1_3_19 weight saved: W0_4_19 weight saved: W1_4_19 weight saved: b0_4_19 weight saved: b1_4_19 weight saved: W_5_19 weight saved: b_5_19 weight saved: W_6_19 weight saved: b_6_19 weight saved: W_7_19 weight saved: b_7_19 epoch 20: validation loss nan epoch 20: validation error nan % ('Learning rate changed to:', array(9.99999901978299e-05, dtype=float32)) weight saved: W_0_20 weight saved: b_0_20 weight saved: W0_1_20 weight saved: W1_1_20 weight saved: b0_1_20 weight saved: b1_1_20 weight saved: W_2_20 weight saved: b_2_20 weight saved: W0_3_20 weight saved: W1_3_20 weight saved: b0_3_20 weight saved: b1_3_20 weight saved: W0_4_20 weight saved: W1_4_20 weight saved: b0_4_20 weight saved: b1_4_20 weight saved: W_5_20 weight saved: b_5_20 weight saved: W_6_20 weight saved: b_6_20 weight saved: W_7_20 weight saved: b_7_20 epoch 21: validation loss nan epoch 21: validation error nan % weight saved: W_0_21 weight saved: b_0_21 weight saved: W0_1_21 weight saved: W1_1_21 weight saved: b0_1_21 weight saved: b1_1_21 weight saved: W_2_21 weight saved: b_2_21 weight saved: W0_3_21 weight saved: W1_3_21 weight saved: b0_3_21 weight saved: b1_3_21 weight saved: W0_4_21 weight saved: W1_4_21 weight saved: b0_4_21 weight saved: b1_4_21 weight saved: W_5_21 weight saved: b_5_21 weight saved: W_6_21 weight saved: b_6_21 weight saved: W_7_21 weight saved: b_7_21 epoch 22: validation loss nan epoch 22: validation error nan % weight saved: W_0_22 weight saved: b_0_22 weight saved: W0_1_22 weight saved: W1_1_22 weight saved: b0_1_22 weight saved: b1_1_22 weight saved: W_2_22 weight saved: b_2_22 weight saved: W0_3_22 weight saved: W1_3_22 weight saved: b0_3_22 weight saved: b1_3_22 weight saved: W0_4_22 weight saved: W1_4_22 weight saved: b0_4_22 weight saved: b1_4_22 weight saved: W_5_22 weight saved: b_5_22 weight saved: W_6_22 weight saved: b_6_22 weight saved: W_7_22 weight saved: b_7_22 epoch 23: validation loss nan epoch 23: validation error nan % weight saved: W_0_23 weight saved: b_0_23 weight saved: W0_1_23 weight saved: W1_1_23 weight saved: b0_1_23 weight saved: b1_1_23 weight saved: W_2_23 weight saved: b_2_23 weight saved: W0_3_23 weight saved: W1_3_23 weight saved: b0_3_23 weight saved: b1_3_23 weight saved: W0_4_23 weight saved: W1_4_23 weight saved: b0_4_23 weight saved: b1_4_23 weight saved: W_5_23 weight saved: b_5_23 weight saved: W_6_23 weight saved: b_6_23 weight saved: W_7_23 weight saved: b_7_23 epoch 24: validation loss nan epoch 24: validation error nan % weight saved: W_0_24 weight saved: b_0_24 weight saved: W0_1_24 weight saved: W1_1_24 weight saved: b0_1_24 weight saved: b1_1_24 weight saved: W_2_24 weight saved: b_2_24 weight saved: W0_3_24 weight saved: W1_3_24 weight saved: b0_3_24 weight saved: b1_3_24 weight saved: W0_4_24 weight saved: W1_4_24 weight saved: b0_4_24 weight saved: b1_4_24 weight saved: W_5_24 weight saved: b_5_24 weight saved: W_6_24 weight saved: b_6_24 weight saved: W_7_24 weight saved: b_7_24 epoch 25: validation loss nan epoch 25: validation error nan % weight saved: W_0_25 weight saved: b_0_25 weight saved: W0_1_25 weight saved: W1_1_25 weight saved: b0_1_25 weight saved: b1_1_25 weight saved: W_2_25 weight saved: b_2_25 weight saved: W0_3_25 weight saved: W1_3_25 weight saved: b0_3_25 weight saved: b1_3_25 weight saved: W0_4_25 weight saved: W1_4_25 weight saved: b0_4_25 weight saved: b1_4_25 weight saved: W_5_25 weight saved: b_5_25 weight saved: W_6_25 weight saved: b_6_25 weight saved: W_7_25 weight saved: b_7_25 epoch 26: validation loss nan epoch 26: validation error nan % weight saved: W_0_26 weight saved: b_0_26 weight saved: W0_1_26 weight saved: W1_1_26 weight saved: b0_1_26 weight saved: b1_1_26 weight saved: W_2_26 weight saved: b_2_26 weight saved: W0_3_26 weight saved: W1_3_26 weight saved: b0_3_26 weight saved: b1_3_26 weight saved: W0_4_26 weight saved: W1_4_26 weight saved: b0_4_26 weight saved: b1_4_26 weight saved: W_5_26 weight saved: b_5_26 weight saved: W_6_26 weight saved: b_6_26 weight saved: W_7_26 weight saved: b_7_26 epoch 27: validation loss nan epoch 27: validation error nan % weight saved: W_0_27 weight saved: b_0_27 weight saved: W0_1_27 weight saved: W1_1_27 weight saved: b0_1_27 weight saved: b1_1_27 weight saved: W_2_27 weight saved: b_2_27 weight saved: W0_3_27 weight saved: W1_3_27 weight saved: b0_3_27 weight saved: b1_3_27 weight saved: W0_4_27 weight saved: W1_4_27 weight saved: b0_4_27 weight saved: b1_4_27 weight saved: W_5_27 weight saved: b_5_27 weight saved: W_6_27 weight saved: b_6_27 weight saved: W_7_27 weight saved: b_7_27 epoch 28: validation loss nan epoch 28: validation error nan % weight saved: W_0_28 weight saved: b_0_28 weight saved: W0_1_28 weight saved: W1_1_28 weight saved: b0_1_28 weight saved: b1_1_28 weight saved: W_2_28 weight saved: b_2_28 weight saved: W0_3_28 weight saved: W1_3_28 weight saved: b0_3_28 weight saved: b1_3_28 weight saved: W0_4_28 weight saved: W1_4_28 weight saved: b0_4_28 weight saved: b1_4_28 weight saved: W_5_28 weight saved: b_5_28 weight saved: W_6_28 weight saved: b_6_28 weight saved: W_7_28 weight saved: b_7_28 epoch 29: validation loss nan epoch 29: validation error nan % weight saved: W_0_29 weight saved: b_0_29 weight saved: W0_1_29 weight saved: W1_1_29 weight saved: b0_1_29 weight saved: b1_1_29 weight saved: W_2_29 weight saved: b_2_29 weight saved: W0_3_29 weight saved: W1_3_29 weight saved: b0_3_29 weight saved: b1_3_29 weight saved: W0_4_29 weight saved: W1_4_29 weight saved: b0_4_29 weight saved: b1_4_29 weight saved: W_5_29 weight saved: b_5_29 weight saved: W_6_29 weight saved: b_6_29 weight saved: W_7_29 weight saved: b_7_29 epoch 30: validation loss nan epoch 30: validation error nan % weight saved: W_0_30 weight saved: b_0_30 weight saved: W0_1_30 weight saved: W1_1_30 weight saved: b0_1_30 weight saved: b1_1_30 weight saved: W_2_30 weight saved: b_2_30 weight saved: W0_3_30 weight saved: W1_3_30 weight saved: b0_3_30 weight saved: b1_3_30 weight saved: W0_4_30 weight saved: W1_4_30 weight saved: b0_4_30 weight saved: b1_4_30 weight saved: W_5_30 weight saved: b_5_30 weight saved: W_6_30 weight saved: b_6_30 weight saved: W_7_30 weight saved: b_7_30 epoch 31: validation loss nan epoch 31: validation error nan % weight saved: W_0_31 weight saved: b_0_31 weight saved: W0_1_31 weight saved: W1_1_31 weight saved: b0_1_31 weight saved: b1_1_31 weight saved: W_2_31 weight saved: b_2_31 weight saved: W0_3_31 weight saved: W1_3_31 weight saved: b0_3_31 weight saved: b1_3_31 weight saved: W0_4_31 weight saved: W1_4_31 weight saved: b0_4_31 weight saved: b1_4_31 weight saved: W_5_31 weight saved: b_5_31 weight saved: W_6_31 weight saved: b_6_31 weight saved: W_7_31 weight saved: b_7_31 epoch 32: validation loss nan epoch 32: validation error nan % weight saved: W_0_32 weight saved: b_0_32 weight saved: W0_1_32 weight saved: W1_1_32 weight saved: b0_1_32 weight saved: b1_1_32 weight saved: W_2_32 weight saved: b_2_32 weight saved: W0_3_32 weight saved: W1_3_32 weight saved: b0_3_32 weight saved: b1_3_32 weight saved: W0_4_32 weight saved: W1_4_32 weight saved: b0_4_32 weight saved: b1_4_32 weight saved: W_5_32 weight saved: b_5_32 weight saved: W_6_32 weight saved: b_6_32 weight saved: W_7_32 weight saved: b_7_32 epoch 33: validation loss nan epoch 33: validation error nan % weight saved: W_0_33 weight saved: b_0_33 weight saved: W0_1_33 weight saved: W1_1_33 weight saved: b0_1_33 weight saved: b1_1_33 weight saved: W_2_33 weight saved: b_2_33 weight saved: W0_3_33 weight saved: W1_3_33 weight saved: b0_3_33 weight saved: b1_3_33 weight saved: W0_4_33 weight saved: W1_4_33 weight saved: b0_4_33 weight saved: b1_4_33 weight saved: W_5_33 weight saved: b_5_33 weight saved: W_6_33 weight saved: b_6_33 weight saved: W_7_33 weight saved: b_7_33 epoch 34: validation loss nan epoch 34: validation error nan % weight saved: W_0_34 weight saved: b_0_34 weight saved: W0_1_34 weight saved: W1_1_34 weight saved: b0_1_34 weight saved: b1_1_34 weight saved: W_2_34 weight saved: b_2_34 weight saved: W0_3_34 weight saved: W1_3_34 weight saved: b0_3_34 weight saved: b1_3_34 weight saved: W0_4_34 weight saved: W1_4_34 weight saved: b0_4_34 weight saved: b1_4_34 weight saved: W_5_34 weight saved: b_5_34 weight saved: W_6_34 weight saved: b_6_34 weight saved: W_7_34 weight saved: b_7_34 epoch 35: validation loss nan epoch 35: validation error nan % weight saved: W_0_35 weight saved: b_0_35 weight saved: W0_1_35 weight saved: W1_1_35 weight saved: b0_1_35 weight saved: b1_1_35 weight saved: W_2_35 weight saved: b_2_35 weight saved: W0_3_35 weight saved: W1_3_35 weight saved: b0_3_35 weight saved: b1_3_35 weight saved: W0_4_35 weight saved: W1_4_35 weight saved: b0_4_35 weight saved: b1_4_35 weight saved: W_5_35 weight saved: b_5_35 weight saved: W_6_35 weight saved: b_6_35 weight saved: W_7_35 weight saved: b_7_35 epoch 36: validation loss nan epoch 36: validation error nan % weight saved: W_0_36 weight saved: b_0_36 weight saved: W0_1_36 weight saved: W1_1_36 weight saved: b0_1_36 weight saved: b1_1_36 weight saved: W_2_36 weight saved: b_2_36 weight saved: W0_3_36 weight saved: W1_3_36 weight saved: b0_3_36 weight saved: b1_3_36 weight saved: W0_4_36 weight saved: W1_4_36 weight saved: b0_4_36 weight saved: b1_4_36 weight saved: W_5_36 weight saved: b_5_36 weight saved: W_6_36 weight saved: b_6_36 weight saved: W_7_36 weight saved: b_7_36 epoch 37: validation loss nan epoch 37: validation error nan % weight saved: W_0_37 weight saved: b_0_37 weight saved: W0_1_37 weight saved: W1_1_37 weight saved: b0_1_37 weight saved: b1_1_37 weight saved: W_2_37 weight saved: b_2_37 weight saved: W0_3_37 weight saved: W1_3_37 weight saved: b0_3_37 weight saved: b1_3_37 weight saved: W0_4_37 weight saved: W1_4_37 weight saved: b0_4_37 weight saved: b1_4_37 weight saved: W_5_37 weight saved: b_5_37 weight saved: W_6_37 weight saved: b_6_37 weight saved: W_7_37 weight saved: b_7_37 epoch 38: validation loss nan epoch 38: validation error nan % weight saved: W_0_38 weight saved: b_0_38 weight saved: W0_1_38 weight saved: W1_1_38 weight saved: b0_1_38 weight saved: b1_1_38 weight saved: W_2_38 weight saved: b_2_38 weight saved: W0_3_38 weight saved: W1_3_38 weight saved: b0_3_38 weight saved: b1_3_38 weight saved: W0_4_38 weight saved: W1_4_38 weight saved: b0_4_38 weight saved: b1_4_38 weight saved: W_5_38 weight saved: b_5_38 weight saved: W_6_38 weight saved: b_6_38 weight saved: W_7_38 weight saved: b_7_38 epoch 39: validation loss nan epoch 39: validation error nan % weight saved: W_0_39 weight saved: b_0_39 weight saved: W0_1_39 weight saved: W1_1_39 weight saved: b0_1_39 weight saved: b1_1_39 weight saved: W_2_39 weight saved: b_2_39 weight saved: W0_3_39 weight saved: W1_3_39 weight saved: b0_3_39 weight saved: b1_3_39 weight saved: W0_4_39 weight saved: W1_4_39 weight saved: b0_4_39 weight saved: b1_4_39 weight saved: W_5_39 weight saved: b_5_39 weight saved: W_6_39 weight saved: b_6_39 weight saved: W_7_39 weight saved: b_7_39 epoch 40: validation loss nan epoch 40: validation error nan % weight saved: W_0_40 weight saved: b_0_40 weight saved: W0_1_40 weight saved: W1_1_40 weight saved: b0_1_40 weight saved: b1_1_40 weight saved: W_2_40 weight saved: b_2_40 weight saved: W0_3_40 weight saved: W1_3_40 weight saved: b0_3_40 weight saved: b1_3_40 weight saved: W0_4_40 weight saved: W1_4_40 weight saved: b0_4_40 weight saved: b1_4_40 weight saved: W_5_40 weight saved: b_5_40 weight saved: W_6_40 weight saved: b_6_40 weight saved: W_7_40 weight saved: b_7_40 epoch 41: validation loss nan epoch 41: validation error nan % weight saved: W_0_41 weight saved: b_0_41 weight saved: W0_1_41 weight saved: W1_1_41 weight saved: b0_1_41 weight saved: b1_1_41 weight saved: W_2_41 weight saved: b_2_41 weight saved: W0_3_41 weight saved: W1_3_41 weight saved: b0_3_41 weight saved: b1_3_41 weight saved: W0_4_41 weight saved: W1_4_41 weight saved: b0_4_41 weight saved: b1_4_41 weight saved: W_5_41 weight saved: b_5_41 weight saved: W_6_41 weight saved: b_6_41 weight saved: W_7_41 weight saved: b_7_41 epoch 42: validation loss nan epoch 42: validation error nan % weight saved: W_0_42 weight saved: b_0_42 weight saved: W0_1_42 weight saved: W1_1_42 weight saved: b0_1_42 weight saved: b1_1_42 weight saved: W_2_42 weight saved: b_2_42 weight saved: W0_3_42 weight saved: W1_3_42 weight saved: b0_3_42 weight saved: b1_3_42 weight saved: W0_4_42 weight saved: W1_4_42 weight saved: b0_4_42 weight saved: b1_4_42 weight saved: W_5_42 weight saved: b_5_42 weight saved: W_6_42 weight saved: b_6_42 weight saved: W_7_42 weight saved: b_7_42 epoch 43: validation loss nan epoch 43: validation error nan % weight saved: W_0_43 weight saved: b_0_43 weight saved: W0_1_43 weight saved: W1_1_43 weight saved: b0_1_43 weight saved: b1_1_43 weight saved: W_2_43 weight saved: b_2_43 weight saved: W0_3_43 weight saved: W1_3_43 weight saved: b0_3_43 weight saved: b1_3_43 weight saved: W0_4_43 weight saved: W1_4_43 weight saved: b0_4_43 weight saved: b1_4_43 weight saved: W_5_43 weight saved: b_5_43 weight saved: W_6_43 weight saved: b_6_43 weight saved: W_7_43 weight saved: b_7_43 epoch 44: validation loss nan epoch 44: validation error nan % weight saved: W_0_44 weight saved: b_0_44 weight saved: W0_1_44 weight saved: W1_1_44 weight saved: b0_1_44 weight saved: b1_1_44 weight saved: W_2_44 weight saved: b_2_44 weight saved: W0_3_44 weight saved: W1_3_44 weight saved: b0_3_44 weight saved: b1_3_44 weight saved: W0_4_44 weight saved: W1_4_44 weight saved: b0_4_44 weight saved: b1_4_44 weight saved: W_5_44 weight saved: b_5_44 weight saved: W_6_44 weight saved: b_6_44 weight saved: W_7_44 weight saved: b_7_44 epoch 45: validation loss nan epoch 45: validation error nan % weight saved: W_0_45 weight saved: b_0_45 weight saved: W0_1_45 weight saved: W1_1_45 weight saved: b0_1_45 weight saved: b1_1_45 weight saved: W_2_45 weight saved: b_2_45 weight saved: W0_3_45 weight saved: W1_3_45 weight saved: b0_3_45 weight saved: b1_3_45 weight saved: W0_4_45 weight saved: W1_4_45 weight saved: b0_4_45 weight saved: b1_4_45 weight saved: W_5_45 weight saved: b_5_45 weight saved: W_6_45 weight saved: b_6_45 weight saved: W_7_45 weight saved: b_7_45 epoch 46: validation loss nan epoch 46: validation error nan % weight saved: W_0_46 weight saved: b_0_46 weight saved: W0_1_46 weight saved: W1_1_46 weight saved: b0_1_46 weight saved: b1_1_46 weight saved: W_2_46 weight saved: b_2_46 weight saved: W0_3_46 weight saved: W1_3_46 weight saved: b0_3_46 weight saved: b1_3_46 weight saved: W0_4_46 weight saved: W1_4_46 weight saved: b0_4_46 weight saved: b1_4_46 weight saved: W_5_46 weight saved: b_5_46 weight saved: W_6_46 weight saved: b_6_46 weight saved: W_7_46 weight saved: b_7_46 epoch 47: validation loss nan epoch 47: validation error nan % weight saved: W_0_47 weight saved: b_0_47 weight saved: W0_1_47 weight saved: W1_1_47 weight saved: b0_1_47 weight saved: b1_1_47 weight saved: W_2_47 weight saved: b_2_47 weight saved: W0_3_47 weight saved: W1_3_47 weight saved: b0_3_47 weight saved: b1_3_47 weight saved: W0_4_47 weight saved: W1_4_47 weight saved: b0_4_47 weight saved: b1_4_47 weight saved: W_5_47 weight saved: b_5_47 weight saved: W_6_47 weight saved: b_6_47 weight saved: W_7_47 weight saved: b_7_47 epoch 48: validation loss nan epoch 48: validation error nan % weight saved: W_0_48 weight saved: b_0_48 weight saved: W0_1_48 weight saved: W1_1_48 weight saved: b0_1_48 weight saved: b1_1_48 weight saved: W_2_48 weight saved: b_2_48 weight saved: W0_3_48 weight saved: W1_3_48 weight saved: b0_3_48 weight saved: b1_3_48 weight saved: W0_4_48 weight saved: W1_4_48 weight saved: b0_4_48 weight saved: b1_4_48 weight saved: W_5_48 weight saved: b_5_48 weight saved: W_6_48 weight saved: b_6_48 weight saved: W_7_48 weight saved: b_7_48 epoch 49: validation loss nan epoch 49: validation error nan % weight saved: W_0_49 weight saved: b_0_49 weight saved: W0_1_49 weight saved: W1_1_49 weight saved: b0_1_49 weight saved: b1_1_49 weight saved: W_2_49 weight saved: b_2_49 weight saved: W0_3_49 weight saved: W1_3_49 weight saved: b0_3_49 weight saved: b1_3_49 weight saved: W0_4_49 weight saved: W1_4_49 weight saved: b0_4_49 weight saved: b1_4_49 weight saved: W_5_49 weight saved: b_5_49 weight saved: W_6_49 weight saved: b_6_49 weight saved: W_7_49 weight saved: b_7_49 epoch 50: validation loss nan epoch 50: validation error nan % weight saved: W_0_50 weight saved: b_0_50 weight saved: W0_1_50 weight saved: W1_1_50 weight saved: b0_1_50 weight saved: b1_1_50 weight saved: W_2_50 weight saved: b_2_50 weight saved: W0_3_50 weight saved: W1_3_50 weight saved: b0_3_50 weight saved: b1_3_50 weight saved: W0_4_50 weight saved: W1_4_50 weight saved: b0_4_50 weight saved: b1_4_50 weight saved: W_5_50 weight saved: b_5_50 weight saved: W_6_50 weight saved: b_6_50 weight saved: W_7_50 weight saved: b_7_50 epoch 51: validation loss nan epoch 51: validation error nan % weight saved: W_0_51 weight saved: b_0_51 weight saved: W0_1_51 weight saved: W1_1_51 weight saved: b0_1_51 weight saved: b1_1_51 weight saved: W_2_51 weight saved: b_2_51 weight saved: W0_3_51 weight saved: W1_3_51 weight saved: b0_3_51 weight saved: b1_3_51 weight saved: W0_4_51 weight saved: W1_4_51 weight saved: b0_4_51 weight saved: b1_4_51 weight saved: W_5_51 weight saved: b_5_51 weight saved: W_6_51 weight saved: b_6_51 weight saved: W_7_51 weight saved: b_7_51 epoch 52: validation loss nan epoch 52: validation error nan % weight saved: W_0_52 weight saved: b_0_52 weight saved: W0_1_52 weight saved: W1_1_52 weight saved: b0_1_52 weight saved: b1_1_52 weight saved: W_2_52 weight saved: b_2_52 weight saved: W0_3_52 weight saved: W1_3_52 weight saved: b0_3_52 weight saved: b1_3_52 weight saved: W0_4_52 weight saved: W1_4_52 weight saved: b0_4_52 weight saved: b1_4_52 weight saved: W_5_52 weight saved: b_5_52 weight saved: W_6_52 weight saved: b_6_52 weight saved: W_7_52 weight saved: b_7_52 epoch 53: validation loss nan epoch 53: validation error nan % weight saved: W_0_53 weight saved: b_0_53 weight saved: W0_1_53 weight saved: W1_1_53 weight saved: b0_1_53 weight saved: b1_1_53 weight saved: W_2_53 weight saved: b_2_53 weight saved: W0_3_53 weight saved: W1_3_53 weight saved: b0_3_53 weight saved: b1_3_53 weight saved: W0_4_53 weight saved: W1_4_53 weight saved: b0_4_53 weight saved: b1_4_53 weight saved: W_5_53 weight saved: b_5_53 weight saved: W_6_53 weight saved: b_6_53 weight saved: W_7_53 weight saved: b_7_53 epoch 54: validation loss nan epoch 54: validation error nan % weight saved: W_0_54 weight saved: b_0_54 weight saved: W0_1_54 weight saved: W1_1_54 weight saved: b0_1_54 weight saved: b1_1_54 weight saved: W_2_54 weight saved: b_2_54 weight saved: W0_3_54 weight saved: W1_3_54 weight saved: b0_3_54 weight saved: b1_3_54 weight saved: W0_4_54 weight saved: W1_4_54 weight saved: b0_4_54 weight saved: b1_4_54 weight saved: W_5_54 weight saved: b_5_54 weight saved: W_6_54 weight saved: b_6_54 weight saved: W_7_54 weight saved: b_7_54 epoch 55: validation loss nan epoch 55: validation error nan % weight saved: W_0_55 weight saved: b_0_55 weight saved: W0_1_55 weight saved: W1_1_55 weight saved: b0_1_55 weight saved: b1_1_55 weight saved: W_2_55 weight saved: b_2_55 weight saved: W0_3_55 weight saved: W1_3_55 weight saved: b0_3_55 weight saved: b1_3_55 weight saved: W0_4_55 weight saved: W1_4_55 weight saved: b0_4_55 weight saved: b1_4_55 weight saved: W_5_55 weight saved: b_5_55 weight saved: W_6_55 weight saved: b_6_55 weight saved: W_7_55 weight saved: b_7_55 epoch 56: validation loss nan epoch 56: validation error nan % weight saved: W_0_56 weight saved: b_0_56 weight saved: W0_1_56 weight saved: W1_1_56 weight saved: b0_1_56 weight saved: b1_1_56 weight saved: W_2_56 weight saved: b_2_56 weight saved: W0_3_56 weight saved: W1_3_56 weight saved: b0_3_56 weight saved: b1_3_56 weight saved: W0_4_56 weight saved: W1_4_56 weight saved: b0_4_56 weight saved: b1_4_56 weight saved: W_5_56 weight saved: b_5_56 weight saved: W_6_56 weight saved: b_6_56 weight saved: W_7_56 weight saved: b_7_56 epoch 57: validation loss nan epoch 57: validation error nan % weight saved: W_0_57 weight saved: b_0_57 weight saved: W0_1_57 weight saved: W1_1_57 weight saved: b0_1_57 weight saved: b1_1_57 weight saved: W_2_57 weight saved: b_2_57 weight saved: W0_3_57 weight saved: W1_3_57 weight saved: b0_3_57 weight saved: b1_3_57 weight saved: W0_4_57 weight saved: W1_4_57 weight saved: b0_4_57 weight saved: b1_4_57 weight saved: W_5_57 weight saved: b_5_57 weight saved: W_6_57 weight saved: b_6_57 weight saved: W_7_57 weight saved: b_7_57 epoch 58: validation loss nan epoch 58: validation error nan % weight saved: W_0_58 weight saved: b_0_58 weight saved: W0_1_58 weight saved: W1_1_58 weight saved: b0_1_58 weight saved: b1_1_58 weight saved: W_2_58 weight saved: b_2_58 weight saved: W0_3_58 weight saved: W1_3_58 weight saved: b0_3_58 weight saved: b1_3_58 weight saved: W0_4_58 weight saved: W1_4_58 weight saved: b0_4_58 weight saved: b1_4_58 weight saved: W_5_58 weight saved: b_5_58 weight saved: W_6_58 weight saved: b_6_58 weight saved: W_7_58 weight saved: b_7_58 epoch 59: validation loss nan epoch 59: validation error nan % weight saved: W_0_59 weight saved: b_0_59 weight saved: W0_1_59 weight saved: W1_1_59 weight saved: b0_1_59 weight saved: b1_1_59 weight saved: W_2_59 weight saved: b_2_59 weight saved: W0_3_59 weight saved: W1_3_59 weight saved: b0_3_59 weight saved: b1_3_59 weight saved: W0_4_59 weight saved: W1_4_59 weight saved: b0_4_59 weight saved: b1_4_59 weight saved: W_5_59 weight saved: b_5_59 weight saved: W_6_59 weight saved: b_6_59 weight saved: W_7_59 weight saved: b_7_59 epoch 60: validation loss nan epoch 60: validation error nan % weight saved: W_0_60 weight saved: b_0_60 weight saved: W0_1_60 weight saved: W1_1_60 weight saved: b0_1_60 weight saved: b1_1_60 weight saved: W_2_60 weight saved: b_2_60 weight saved: W0_3_60 weight saved: W1_3_60 weight saved: b0_3_60 weight saved: b1_3_60 weight saved: W0_4_60 weight saved: W1_4_60 weight saved: b0_4_60 weight saved: b1_4_60 weight saved: W_5_60 weight saved: b_5_60 weight saved: W_6_60 weight saved: b_6_60 weight saved: W_7_60 weight saved: b_7_60 Optimization complete.

PyCUDA ERROR: The context stack was not empty upon module cleanup.

A context was still active when the context stack was being cleaned up. At this point in our execution, CUDA may already have been deinitialized, so there is no way we can finish cleanly. The program will be aborted now. Use Context.pop() to avoid this problem.

This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information.

Process finished with exit code 0

Also:

If para_load: True then I get this error LogicError: cuIpcGetMemHandle failed: OS call failed or operation not supported on this OS

(C:\Users\arjun\Anaconda2) D:\xxxxxyyyy>python train.py Process Process-2: Traceback (most recent call last): File "C:\Users\arjun\Anaconda2\lib\multiprocessing\process.py", line 258, in _bootstrap self.run() File "C:\Users\arjun\Anaconda2\lib\multiprocessing\process.py", line 114, in run self._target(*self._args, *self._kwargs) File "D:\Rough\random\xxx\xxxxxyyyy\proc_load.py", line 98, in fun_load sock.bind('tcp://:{0}'.format(sock_data)) File "zmq/backend/cython/socket.pyx", line 495, in zmq.backend.cython.socket.Socket.bind (zmq\backend\cython\socket.c:5653) File "zmq/backend/cython/checkrc.pxd", line 25, in zmq.backend.cython.checkrc._check_rc (zmq\backend\cython\socket.c:10014) raise ZMQError(errno) ZMQError: Address in use

PyCUDA ERROR: The context stack was not empty upon module cleanup.

A context was still active when the context stack was being cleaned up. At this point in our execution, CUDA may already have been deinitialized, so there is no way we can finish cleanly. The program will be aborted now. Use Context.pop() to avoid this problem.

This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information. WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10). Please switch to the gpuarray backend. You can get more information about how to switch at this URL: https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using gpu device 0: GeForce GTX 1080 (CNMeM is enabled with initial size: 80.0% of memory, cuDNN 5110) ... building the model conv (cudnn) layer with shape_in: (3, 227, 227, 256) conv (cudnn) layer with shape_in: (96, 27, 27, 256) conv (cudnn) layer with shape_in: (256, 13, 13, 256) conv (cudnn) layer with shape_in: (384, 13, 13, 256) conv (cudnn) layer with shape_in: (384, 13, 13, 256) fc layer with num_in: 9216 num_out: 4096 dropout layer with P_drop: 0.5 fc layer with num_in: 4096 num_out: 4096 dropout layer with P_drop: 0.5 softmax layer with num_in: 4096 num_out: 1000 ... training Process Process-1: Traceback (most recent call last): File "C:\Users\arjun\Anaconda2\lib\multiprocessing\process.py", line 258, in _bootstrap self.run() File "C:\Users\arjun\Anaconda2\lib\multiprocessing\process.py", line 114, in run self._target(*self._args, **self._kwargs) File "D:\Rough\random\xxx\xxxxxyyyy\train.py", line 69, in train_net h = drv.mem_get_ipc_handle(gpuarray_batch.ptr) LogicError: cuIpcGetMemHandle failed: OS call failed or operation not supported on this OS

PyCUDA ERROR: The context stack was not empty upon module cleanup.

A context was still active when the context stack was being cleaned up. At this point in our execution, CUDA may already have been deinitialized, so there is no way we can finish cleanly. The program will be aborted now. Use Context.pop() to avoid this problem.

This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information.

It will be really helpful, if i can get some suggestions or approximate solution. Thank you in advance.

hma02 commented 7 years ago

The "ZMQError: Address in use" error happens when the previous run failed and the socket port opened in the previous run was not closed properly causing port conflict in the next run. You can search the process opening the port by:

netstat -ltnp

and kill the corresponding process.

For the NAN issue, if it happened from the first epoch, this could be caused by input batch not being fed or preprocessed correctly. Or it can be caused by using too large learning rate. See issue #27.

Magotraa commented 7 years ago

@hma02 Thanks for sharing. I am trying the suggested solutions. Do we have any solution on windows os 10 for

LogicError: cuIpcGetMemHandle failed: OS call failed or operation not supported on this OS

Magotraa commented 7 years ago

@hma02 Thank you for your suggestions for "For the NAN issue", the problem was it was not able to read the training data. Now it is training. Now, I want to know how to get the good accuracy results.

Can you share, after how many iterations I should expect for accuracy. Also, if you can share optimized hyper parameters file config.yml.

current status is:

('training error rate:', array(0.984375)) ('training @ iter = ', 2765) ('training cost:', array(6.374232292175293, dtype=float32)) ('training error rate:', array(0.99609375)) ('training @ iter = ', 2770) ('training cost:', array(6.3500189781188965, dtype=float32)) ('training error rate:', array(0.984375)) ('training @ iter = ', 2775) ('training cost:', array(6.216220855712891, dtype=float32)) ('training error rate:', array(0.98828125)) ('training @ iter = ', 2780) ('training cost:', array(6.231907844543457, dtype=float32)) ('training error rate:', array(0.98828125)) ('training @ iter = ', 2785) ('training cost:', array(6.30079460144043, dtype=float32)) ('training error rate:', array(0.99609375))

Magotraa commented 7 years ago

@hma02 Hi, I have this experiment running with this current results: Can you suggest any improvements to, achieve better accuracy and less training error.

('training cost:', array(4.295770168304443, dtype=float32)) ('training error rate:', array(0.8046875)) ('training @ iter = ', 8165) ('training cost:', array(4.224380016326904, dtype=float32)) ('training error rate:', array(0.8125)) ('training @ iter = ', 8170) ('training cost:', array(4.512507438659668, dtype=float32)) ('training error rate:', array(0.90234375)) ('training @ iter = ', 8175) ('training cost:', array(4.5337233543396, dtype=float32)) ('training error rate:', array(0.8515625)) ('training @ iter = ', 8180) ('training cost:', array(4.498597145080566, dtype=float32)) ('training error rate:', array(0.82421875)) ('training @ iter = ', 8185) ('training cost:', array(4.465353012084961, dtype=float32)) ('training error rate:', array(0.84375)) ('training @ iter = ', 8190) ('training cost:', array(4.593122482299805, dtype=float32)) ('training error rate:', array(0.82421875))

hma02 commented 7 years ago

@aryanbhardwaj ,

Your training cost looks okay so far. Are you training on ImageNet data? If you follow the preprocess steps in this project, you will see 5004 batch files of batch size 256 for single GPU training. That means one epoch will take 5004 iterations. The hyperparams in config.yaml are already the optimized values found so far. That means you need to train for 60 epochs or 60*5004 iterations in total.

Magotraa commented 7 years ago

@hma02 Thank you for the quick reply. yes. you are correct, but may be number of batch files is little different. However, why we have two training data folders, _hkl_b256_b_128 and train_hkl_b256_b_256. Is there specific reason to have size 128 folder.

hma02 commented 7 years ago

@aryanbhardwaj This preprocessing setup is for doing multi-GPU training. Specifically, single GPU trains with batch_size=256, two GPUs train with batch_size=128 on each GPU, and 4 GPUs will train with batch_size=64 on each GPU...etc. This is to preserve the effective batch size (n_GPUs*batch_size) when scaling to multiple GPUs.

Magotraa commented 7 years ago

@hma02 Thank you for this insight. Just wondering how long it should take to complete the training. Also if you know some way to understand the weights better. As in if I can read the weights and bias and understand it better.

I mean visualize hidden layer weight and bias values, read the values with some tool or may be some text or reference to know in detail about hidden layer weights and bias.

Magotraa commented 7 years ago

@hma02 Is there any specific naming patterns that's used for naming the weights for different layers of the network? For my understanding, any suggestions?

Also if you can share some insight on using "group" in the convolution layers..

thank you in advance.

hma02 commented 7 years ago

@aryanbhardwaj

We benchmarked training speed on GTX 1080 and Tesla K80. For GTX 1080, it takes 0.91h per epoch. For Tesla K80, it takes 1.96h per epoch. Totally 60 epochs, so it takes around 54h for GTX 1080 and around 120h for Tesla K80.

We didn't experiment on visualizing weights. You can simply read those weight files using numpy.load().

To visualize the activation like here, you can construct another theano function to output the self.output of each layer and plot them using imshow from matplotlib.

The naming pattern of saved weights is defined in this function, basically just "layer_index" + "epoch". Some weights has a number following W or b like W0 or b0 and W1 or b1, because they are from the alexnet grouped convolution layer. Inside those layers, there are two parallel sub-convolutions. Each has a weight.

Magotraa commented 7 years ago

@hma02 I am able to train the alexnet now, thank you for all the suggestions.

Now, I am trying to train on imagenet using my network. But the training error or validation error does not improve at all.

Any suggestions!!!

('training @ iter = ', 61040) ('training cost:', array(6.920103549957275, dtype=float32)) ('training error rate:', array(0.9921875)) ('training @ iter = ', 61045) ('training cost:', array(6.905889511108398, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61050) ('training cost:', array(6.9157304763793945, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61055) ('training cost:', array(6.915121078491211, dtype=float32)) ('training error rate:', array(0.9921875)) ('training @ iter = ', 61060) ('training cost:', array(6.9073486328125, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61065) ('training cost:', array(6.910022735595703, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61070) ('training cost:', array(6.898440361022949, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61075) ('training cost:', array(6.900564193725586, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61080) ('training cost:', array(6.9025468826293945, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61085) ('training cost:', array(6.906184196472168, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61090) ('training cost:', array(6.913963317871094, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61095) ('training cost:', array(6.90643310546875, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61100) ('training cost:', array(6.9034423828125, dtype=float32)) ('training error rate:', array(0.9921875)) ('training @ iter = ', 61105) ('training cost:', array(6.9006123542785645, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61110) ('training cost:', array(6.908158302307129, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61115) ('training cost:', array(6.901939392089844, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61120) ('training cost:', array(6.902793884277344, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61125) ('training cost:', array(6.899314880371094, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61130) ('training cost:', array(6.9046478271484375, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61135) ('training cost:', array(6.907194137573242, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61140) ('training cost:', array(6.91206169128418, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61145) ('training cost:', array(6.901838302612305, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61150) ('training cost:', array(6.904903411865234, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61155) ('training cost:', array(6.90507698059082, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61160) ('training cost:', array(6.911441802978516, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61165) ('training cost:', array(6.907763957977295, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61170) ('training cost:', array(6.909838676452637, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61175) ('training cost:', array(6.905656814575195, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61180) ('training cost:', array(6.905083179473877, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61185) ('training cost:', array(6.907958984375, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61190) ('training cost:', array(6.904727935791016, dtype=float32)) ('training error rate:', array(0.9921875)) ('training @ iter = ', 61195) ('training cost:', array(6.9050397872924805, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61200) ('training cost:', array(6.90727424621582, dtype=float32)) ('training error rate:', array(0.9921875)) ('training @ iter = ', 61205) ('training cost:', array(6.905116558074951, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61210) ('training cost:', array(6.899809837341309, dtype=float32)) ('training error rate:', array(1.0))

Magotraa commented 7 years ago

@hma02 If possible please suggest something on the above-mentioned issue. Also, if there any relation between the depth of network and learning rate.

gwding commented 7 years ago

@aryanbhardwaj usually you can try small learning rates until you see some training progress on training data (if you don't see training loss decrease at all, usually there's a bug, maybe in the data pipeline). and then try larger learning rate to learn faster.

hma02 commented 7 years ago

@aryanbhardwaj Yes, data pipeline would be the first to check. Verify that your training data matches the training labels. The cost not decreasing issue could be due to a bad network initialization as well. For example, try tweaking the mean and std of your gaussian initializer. You can follow some of the standard ways of initializing weights like here.

You can also monitor the gradient flow along training to see if the gradient is in a reasonable magnitude (e.g. 1e-1 to 1e-3). Try constructing a theano function that outputs self.grads.

Magotraa commented 7 years ago

@gwding and @hma02 Thank you, I will try to find the solution on these directions.

Magotraa commented 7 years ago

@hma02 and @gwding

I want to thank you for your suggestions they were helpful. I am currently trying to test the results using actual images from google that if the learned weights can label that images correctly. Do we have any existing sample to refer? Or please suggest any ideas that may be helpful.

Magotraa commented 7 years ago

@hma02 If possible please suggest something on the above-mentioned issue.

hma02 commented 7 years ago

@aryanbhardwaj

Interesting. I haven't tried that yet. But I imagine that would require the object to be in some ratio range with respect to the image size as the way they gather imagenet images.

Then you can do the same preprocessing as in the processing folder, e.g., resizing to 256 by 256 and saving into hkl files in int8.

Finally load those hkl files and crop 227 by 227 patches to feed the network.

On Jul 1, 2017, at 10:02, aryanbhardwaj notifications@github.com wrote:

@hma02 If possible please suggest something on the above-mentioned issue.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.