Open Zackory opened 10 years ago
I won't be able to fix the CPU counter-part in upcoming v0.7 release, unfortunately.
CPU counter-part? At the moment I am attempting to train using GPUs.
It is not on GPU if it is calling ccv_convnet_encode. Do you have HAVE_CUDA compliation flag in your config.mk file?
Thanks for that, just needed to correct the path to cuda within the configure file.
I'm currently using the stable version, but reaching this error when training now:
./ccv-stable/bin/image-net --train-list train257PNGLegendLibccv.txt --test-list cv257PNGLegendLibccv.txt --working-dir image-net.sqlite3
- compute mean activity 22500 / 22500
- compute covariance matrix for data augmentation (color gain) 22500 / 22500
image-net: cuda/cwc_convnet.cu:764: int _cwc_convnet_convolutional_forward_propagate_vary(ccv_convnet_layer_t*, int, int, int, float*, float*, CUstream_st* const&, int, int, int): Assertion `cudaGetLastError() == cudaSuccess' failed.
Aborted (core dumped)
What's your GPU card? From the log, seems it crashed on the first forward pass, which is interesting to me.
I'm running with a NVIDIA Tesla C2075 GPU.
Thanks! Some explanations:
The forward_vary function is an auto-tuning function which finds the optimal kernels to perform convolution. The parameters are checked against the limits on Kelper arch, and I haven't verified on Fermi arch.
There are two fixes: 1). I check against Fermi arch to see what's going on there, but I don't have the machine / card; 2). I can handle the kernel "cannot launch" error and ignore these
Is there any more information I can provide that would be helpful?
I will put up a fix based on 2 to see if that is the issue (unfortunatley, that will be in unstable though).
Can you backport this change to see if fixed this?
Same error with the changes.
image-net: cuda/cwc_convnet.cu:768: int _cwc_convnet_convolutional_forward_propagate_vary(ccv_convnet_layer_t*, int, int, int, float*, float*, CUstream_st* const&, int, int, int): Assertion `error == cudaSuccess' failed.
Hey, @Zackory , sorry the fact that I don't have a Fermi makes everything like a very speculative fix. Can you add:
printf("cuda error %s\n", cudaGetErrorString(error));
before the assertion so that I know the exact failures that I should avoid? Thanks! It will be pretty hard for me to get a Fermi board to just fix this issue :(
Here is the error:
cuda error invalid device function
Thanks. I am not sure which device function is not available for Fermi arch, need to check.
I've been attempting to train a new classifier using the steps provided in http://libccv.org/doc/doc-convnet/ but I seem to be running into a problem. Everytime I run
./image-net
to train a new model I get an error:I've tried training with both jpg and png images, along with partially replacing image-net.c with https://gist.github.com/liuliu/8906523, however every attempt arrives at the same error.
What would be causing this, and possibly how can it bit fixed or worked around?