Open Magotraa opened 7 years ago
@aryanbhardwaj Could you provide the full Traceback of the error? I just want to see the files that produce this error.
@hma02 Thank you for your reply. I was able to resolve the error by making modification in alex_net.py, in line 26 y = T.ivector('y')
This problem was also mentioned in #32
@hma02 yes, I did refer it. Thank you. However, I am still getting this issue. Can you please suggest some solution.
Error: epoch 56: validation loss nan epoch 56: validation error nan %
Complete Output is here:
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10). Please switch to the gpuarray backend. You can get more information about how to switch at this URL: https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29
Using gpu device 0: GeForce GTX 1080 (CNMeM is enabled with initial size: 80.0% of memory, cuDNN 5110) WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10). Please switch to the gpuarray backend. You can get more information about how to switch at this URL: https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29
This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information.
Process finished with exit code 0
Also:
If para_load: True then I get this error LogicError: cuIpcGetMemHandle failed: OS call failed or operation not supported on this OS
This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information. WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10). Please switch to the gpuarray backend. You can get more information about how to switch at this URL: https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29
This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information.
It will be really helpful, if i can get some suggestions or approximate solution. Thank you in advance.
The "ZMQError: Address in use" error happens when the previous run failed and the socket port opened in the previous run was not closed properly causing port conflict in the next run. You can search the process opening the port by:
netstat -ltnp
and kill the corresponding process.
For the NAN issue, if it happened from the first epoch, this could be caused by input batch not being fed or preprocessed correctly. Or it can be caused by using too large learning rate. See issue #27.
@hma02 Thanks for sharing. I am trying the suggested solutions. Do we have any solution on windows os 10 for
LogicError: cuIpcGetMemHandle failed: OS call failed or operation not supported on this OS
@hma02 Thank you for your suggestions for "For the NAN issue", the problem was it was not able to read the training data. Now it is training. Now, I want to know how to get the good accuracy results.
Can you share, after how many iterations I should expect for accuracy. Also, if you can share optimized hyper parameters file config.yml.
current status is:
('training error rate:', array(0.984375)) ('training @ iter = ', 2765) ('training cost:', array(6.374232292175293, dtype=float32)) ('training error rate:', array(0.99609375)) ('training @ iter = ', 2770) ('training cost:', array(6.3500189781188965, dtype=float32)) ('training error rate:', array(0.984375)) ('training @ iter = ', 2775) ('training cost:', array(6.216220855712891, dtype=float32)) ('training error rate:', array(0.98828125)) ('training @ iter = ', 2780) ('training cost:', array(6.231907844543457, dtype=float32)) ('training error rate:', array(0.98828125)) ('training @ iter = ', 2785) ('training cost:', array(6.30079460144043, dtype=float32)) ('training error rate:', array(0.99609375))
@hma02 Hi, I have this experiment running with this current results: Can you suggest any improvements to, achieve better accuracy and less training error.
('training cost:', array(4.295770168304443, dtype=float32)) ('training error rate:', array(0.8046875)) ('training @ iter = ', 8165) ('training cost:', array(4.224380016326904, dtype=float32)) ('training error rate:', array(0.8125)) ('training @ iter = ', 8170) ('training cost:', array(4.512507438659668, dtype=float32)) ('training error rate:', array(0.90234375)) ('training @ iter = ', 8175) ('training cost:', array(4.5337233543396, dtype=float32)) ('training error rate:', array(0.8515625)) ('training @ iter = ', 8180) ('training cost:', array(4.498597145080566, dtype=float32)) ('training error rate:', array(0.82421875)) ('training @ iter = ', 8185) ('training cost:', array(4.465353012084961, dtype=float32)) ('training error rate:', array(0.84375)) ('training @ iter = ', 8190) ('training cost:', array(4.593122482299805, dtype=float32)) ('training error rate:', array(0.82421875))
@aryanbhardwaj ,
Your training cost looks okay so far. Are you training on ImageNet data? If you follow the preprocess steps in this project, you will see 5004 batch files of batch size 256 for single GPU training. That means one epoch will take 5004 iterations. The hyperparams in config.yaml are already the optimized values found so far. That means you need to train for 60 epochs or 60*5004 iterations in total.
@hma02 Thank you for the quick reply. yes. you are correct, but may be number of batch files is little different. However, why we have two training data folders, _hkl_b256_b_128 and train_hkl_b256_b_256. Is there specific reason to have size 128 folder.
@aryanbhardwaj This preprocessing setup is for doing multi-GPU training. Specifically, single GPU trains with batch_size=256, two GPUs train with batch_size=128 on each GPU, and 4 GPUs will train with batch_size=64 on each GPU...etc. This is to preserve the effective batch size (n_GPUs*batch_size) when scaling to multiple GPUs.
@hma02 Thank you for this insight. Just wondering how long it should take to complete the training. Also if you know some way to understand the weights better. As in if I can read the weights and bias and understand it better.
I mean visualize hidden layer weight and bias values, read the values with some tool or may be some text or reference to know in detail about hidden layer weights and bias.
@hma02 Is there any specific naming patterns that's used for naming the weights for different layers of the network? For my understanding, any suggestions?
Also if you can share some insight on using "group" in the convolution layers..
thank you in advance.
@aryanbhardwaj
We benchmarked training speed on GTX 1080 and Tesla K80. For GTX 1080, it takes 0.91h per epoch. For Tesla K80, it takes 1.96h per epoch. Totally 60 epochs, so it takes around 54h for GTX 1080 and around 120h for Tesla K80.
We didn't experiment on visualizing weights. You can simply read those weight files using numpy.load().
To visualize the activation like here, you can construct another theano function to output the self.output of each layer and plot them using imshow from matplotlib.
The naming pattern of saved weights is defined in this function, basically just "layer_index" + "epoch". Some weights has a number following W or b like W0 or b0 and W1 or b1, because they are from the alexnet grouped convolution layer. Inside those layers, there are two parallel sub-convolutions. Each has a weight.
@hma02 I am able to train the alexnet now, thank you for all the suggestions.
Now, I am trying to train on imagenet using my network. But the training error or validation error does not improve at all.
Any suggestions!!!
('training @ iter = ', 61040) ('training cost:', array(6.920103549957275, dtype=float32)) ('training error rate:', array(0.9921875)) ('training @ iter = ', 61045) ('training cost:', array(6.905889511108398, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61050) ('training cost:', array(6.9157304763793945, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61055) ('training cost:', array(6.915121078491211, dtype=float32)) ('training error rate:', array(0.9921875)) ('training @ iter = ', 61060) ('training cost:', array(6.9073486328125, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61065) ('training cost:', array(6.910022735595703, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61070) ('training cost:', array(6.898440361022949, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61075) ('training cost:', array(6.900564193725586, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61080) ('training cost:', array(6.9025468826293945, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61085) ('training cost:', array(6.906184196472168, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61090) ('training cost:', array(6.913963317871094, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61095) ('training cost:', array(6.90643310546875, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61100) ('training cost:', array(6.9034423828125, dtype=float32)) ('training error rate:', array(0.9921875)) ('training @ iter = ', 61105) ('training cost:', array(6.9006123542785645, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61110) ('training cost:', array(6.908158302307129, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61115) ('training cost:', array(6.901939392089844, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61120) ('training cost:', array(6.902793884277344, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61125) ('training cost:', array(6.899314880371094, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61130) ('training cost:', array(6.9046478271484375, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61135) ('training cost:', array(6.907194137573242, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61140) ('training cost:', array(6.91206169128418, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61145) ('training cost:', array(6.901838302612305, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61150) ('training cost:', array(6.904903411865234, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61155) ('training cost:', array(6.90507698059082, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61160) ('training cost:', array(6.911441802978516, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61165) ('training cost:', array(6.907763957977295, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61170) ('training cost:', array(6.909838676452637, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61175) ('training cost:', array(6.905656814575195, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61180) ('training cost:', array(6.905083179473877, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61185) ('training cost:', array(6.907958984375, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61190) ('training cost:', array(6.904727935791016, dtype=float32)) ('training error rate:', array(0.9921875)) ('training @ iter = ', 61195) ('training cost:', array(6.9050397872924805, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61200) ('training cost:', array(6.90727424621582, dtype=float32)) ('training error rate:', array(0.9921875)) ('training @ iter = ', 61205) ('training cost:', array(6.905116558074951, dtype=float32)) ('training error rate:', array(1.0)) ('training @ iter = ', 61210) ('training cost:', array(6.899809837341309, dtype=float32)) ('training error rate:', array(1.0))
@hma02 If possible please suggest something on the above-mentioned issue. Also, if there any relation between the depth of network and learning rate.
@aryanbhardwaj usually you can try small learning rates until you see some training progress on training data (if you don't see training loss decrease at all, usually there's a bug, maybe in the data pipeline). and then try larger learning rate to learn faster.
@aryanbhardwaj Yes, data pipeline would be the first to check. Verify that your training data matches the training labels. The cost not decreasing issue could be due to a bad network initialization as well. For example, try tweaking the mean and std of your gaussian initializer. You can follow some of the standard ways of initializing weights like here.
You can also monitor the gradient flow along training to see if the gradient is in a reasonable magnitude (e.g. 1e-1 to 1e-3). Try constructing a theano function that outputs self.grads
.
@gwding and @hma02 Thank you, I will try to find the solution on these directions.
@hma02 and @gwding
I want to thank you for your suggestions they were helpful. I am currently trying to test the results using actual images from google that if the learned weights can label that images correctly. Do we have any existing sample to refer? Or please suggest any ideas that may be helpful.
@hma02 If possible please suggest something on the above-mentioned issue.
@aryanbhardwaj
Interesting. I haven't tried that yet. But I imagine that would require the object to be in some ratio range with respect to the image size as the way they gather imagenet images.
Then you can do the same preprocessing as in the processing folder, e.g., resizing to 256 by 256 and saving into hkl files in int8.
Finally load those hkl files and crop 227 by 227 patches to feed the network.
On Jul 1, 2017, at 10:02, aryanbhardwaj notifications@github.com wrote:
@hma02 If possible please suggest something on the above-mentioned issue.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Hi, Thank you for the repository. I have installed the requirements and started the process as mentioned. I could prepare the prepossessed data. However, when I execute Train.py, " I get the error "ERROR"TypeError: Cannot convert Type TensorType(int32, vector) (of Variable <TensorType(int32, vector)>) into Type TensorType(int64, vector). You can try to manually convert <TensorType(int32, vector)> into a TensorType(int64, vector)."