Closed janchk closed 7 years ago
@janchk Oh sorry, this issue should be reported here instead: https://github.com/naibaf7/caffe since the standalone LibDNN is not the same/affected. But ok now that it's here let's leave it that way. Just so you know for the next time.
I see that you use VGG-19 model which contains average-pooling.
layer {
bottom: "conv4_4"
top: "pool4"
name: "pool4"
type: "Pooling"
pooling_param {
pool: AVE
kernel_size: 2
stride: 2
}
}
I just fixed average-pooling in OpenCL-Caffe LibDNN yesterday, so if you try again today with updating & recompiling it should work fine :) You can confirm this by running the "runtest" again and the average pooling error should disappear. LibDNN is quite new, especially the pooling feature, so some bugs can occur. However convolutions, average and max pooling are strictly unit-tested against Caffe's engine now (100 random configuration samples per runtest).
Related-Commit: https://github.com/naibaf7/caffe/commit/6ab003a2ae7fb53d1db777d7cafd6ff8fd3e9c93
This may also be related: https://github.com/naibaf7/caffe/commit/2f4c6b6bc426107dd2458fe1c16daa2329c012d7
@bhack See this is why those features are not yet in the standalone LibDNN, you will only get them after they're thoroughly tested against bugs in the Caffe environment :) Should be ready soon!
@naibaf7 Having not in sync source of objects with the same name (libdnn) IMHO don't help so much. But I cannot find any quick solution that could let Caffe to use the standalone version as upstream cause actually responsability separation require code duplication for device setup and program handling.
@janchk OK there seems to be one more issue not captured by unit tests so far. I'll have to take another look.
@janchk OK it seems like the style transfer needs to reshape the network dimensions. This wasn't accounted for in LibDNN layers yet, so I had to add functionality that recompiles the LibDNN kernel if, during a call to Reshape, the shapes really change: https://github.com/naibaf7/caffe/commit/c3a9f277fe526bc3bdb43c921946086049e72a35 This most certainly fixes it, but I didn't test it myself yet. Would be good if you can report back the performance numbers (how long it took) and if it looks correct now.
@naibaf7
Two times longer than the previous version. (Does not finished yet) (can't check if it is correct)
This is my make all
log. If it matter.
caffe_libdnn_mkall.txt
@janchk It makes sense that it takes longer, because the kernels are now actually compiled for the reshaped sizes. Before it computed the original VGG19 size, which was smaller than the image. I tested it myself now and had 22 minutes on a W9100/R9 290X (LibDNN) and 12 minutes on a GTX 1080 (cuDNN). I see your GPU is only clocked at 400 MHz. Is it possible to set the clock to the full 825 MHz? Or is this fluctuating? Otherwise, it seems right, given this GPU has 0.6 TFLOP. (W9100: about 5 TFLOP, GTX 1080: 10 TFLOP).
@naibaf7 No fluctuating. Stuck on 400MHZ on load. I have no experience with overclocking mobile GPU (on Linux especially) It seems to be unstable with portable cooling. By the way it's 8750M. Anyway after your commit it works as expected. Still better than the original one. Thank you very much! Good luck with upcoming development!
@janchk Ok cool, so I'll close this case. Yes the 8750M has 0.6 TFLOPs of performance. It's pretty decent for a notebook card, but obviously not the fastest if you are interested in deep learning. However probably still 3-5 times faster than the laptop CPU...
Hello again. Faced with problem using libDNN instead of caffe engine when trying to use https://github.com/fzliu/style-transfer this thing.
With caffe engine I got this. (as planned)
and
make runtest
loglog_runtest_caffeengn.txt
With libDNN I got this. (strange)
and
make runtest
log_runtest_libDNN.txt
Have no idea what the reason for this glitch.