naibaf7 commented 8 years ago

@gongzg Totally stuck here, tried to find out the cause for hours. Any ideas?

Beignet git master from today (21.07.2016)
Fedora 24, CLANG-3.8, LLVM-3.8, GCC-6.1.1
Command ./build/tools/caffe time -model models/bvlc_alexnet/benchmark64.prototxt -gpu=0 -iterations=5
Compiled with USE_GREENTEA, USE_LIBDNN, USE_INTEL_SPATIAL

Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
*** Aborted at 1469222493 (unix time) try "date -d @1469222493" if you are using GNU date ***
PC: @     0x7fb9004d78b6 llvm::BasicBlock::getTerminator()
*** SIGSEGV (@0x30000024c) received by PID 3242 (TID 0x7fb916044a40) from PID 588; stack trace: ***
    @     0x7fb90fcffc30 (unknown)
    @     0x7fb9004d78b6 llvm::BasicBlock::getTerminator()
    @     0x7fb900425b05 llvm::LoopBase<>::getExitBlocks()
    @     0x7fb900427c65 llvm::Loop::hasDedicatedExits()
    @     0x7fb900427e16 llvm::Loop::getLoopID()
    @     0x7fb8ffde530c gbe::CustomLoopUnroll::GetUnrollMetadataValue()
    @     0x7fb8ffde5e2a gbe::CustomLoopUnroll::runOnLoop()
    @     0x7fb90043100b llvm::LPPassManager::runOnFunction()
    @     0x7fb9005e2528 llvm::FPPassManager::runOnFunction()
    @     0x7fb9003be3d7 (anonymous namespace)::CGPassManager::runOnModule()
    @     0x7fb9005e2bdd llvm::legacy::PassManagerImpl::run()
    @     0x7fb8ffde02fb gbe::runModulePass()
    @     0x7fb8ffde088e gbe::llvmToGen()
    @     0x7fb8ffd38023 gbe::Program::buildFromLLVMFile()
    @     0x7fb8fff3f8b9 gbe::genProgramNewFromLLVM()
    @     0x7fb8ffd3c7b5 gbe::programNewFromSource()
    @     0x7fb900f76509 cl_program_build
    @     0x7fb900f69d98 clBuildProgram
    @     0x7fb915a8524a viennacl::ocl::context::add_program()
    @     0x7fb915a82830 caffe::submit_conv_spatial_program()
    @     0x7fb915b55f28 caffe::ConvolutionLayerSpatial<>::setup_IDLF()
    @     0x7fb915b56a75 caffe::ConvolutionLayerSpatial<>::setup_convolution()
    @     0x7fb915b58a01 caffe::ConvolutionLayerSpatial<>::Forward_gpu()
    @     0x7fb915a51e82 caffe::Net<>::ForwardFromTo()
    @     0x7fb915a51f77 caffe::Net<>::Forward()
    @           0x414d93 time()
    @           0x40ea4e main
    @     0x7fb90f94d731 __libc_start_main
    @           0x40f389 _start
Segmentation fault (core dumped)

naibaf7 commented 8 years ago

@gongzg Maybe worth noting that ./build/test/test_all.testbin --gtest_filter=*Spatial* 1 passes without errors. But on the actual AlexNet that would be interesting as a benchmark, it fails.

gongzg commented 8 years ago

@naibaf7 The error message indicates this is a LLVM related issue. I would suggest to switch to LLVM 3.6 to have a try. If you still have any issue, please let me know.

gongzg commented 8 years ago

@naibaf7 Another quick try is to open the file backend/src/llvm/llvm_to_gen.cpp and find the following code, then comment out the "MPM.add(createCustomLoopUnrollPass());" Then have a try with you current llvm version. But this is not recommended. I doubt whether beignet is tested with this LLVM and don't know whether there is any other issue. Anyway, for your refernece.

if !defined(ANDROID)

  MPM.add(createCustomLoopUnrollPass()); //1024, 32, 1024, 512)); //Unroll loops

endif

naibaf7 commented 8 years ago

@gongzg Ok good to know. Interesting though that some of the code (as for example the tests) do work. I can't downgrade to llvm-3.6 or llvm-3.7 on Fedora 24 without destroying the graphics driver (X11 won't start anymore for some reason), so I'll try the "dirty fix" instead and hope beignet development will catch up soon.

naibaf7 commented 8 years ago

@gongzg Ok nice, I finally got it working. But the performance doesn't quite make sense to me. Do you know what could have gone wrong now?

Command: ./build/tools/caffe time -model models/bvlc_alexnet/benchmark64.prototxt -gpu=1 -iterations=5

Results (average forward pass time):

Default convolution engine (ViennaCL-BLAS): 2.2 seconds
LibDNN convolution engine (unreleased version 2): 1.4 seconds
Spatial convolution engine (Intel): 4.4 seconds

gongzg commented 8 years ago

@naibaf7 did you check the break down performance for each layer? I used to see very bad GEMM performance with either ISAAC or ViennaCL blas, and most of the time is for the convolution backward path and the FC layers. And I found you specified the gpu device 1, do you have more than one OCL device in your system?

naibaf7 commented 8 years ago

@gongzg Yes but except for the forward convolution, all variants use the same kernels on all other layers. Yes, GPU device 0 is the Intel CPU on my system.

Default convolution engine (clBLAS 2.4): 2.7 seconds, so a bit slower than ViennaCL-BLAS.

I0723 16:36:23.906051 30903 common.cpp:373] Total devices: 2
I0723 16:36:23.906175 30903 common.cpp:374] CUDA devices: 0
I0723 16:36:23.906180 30903 common.cpp:375] OpenCL devices: 2
I0723 16:36:23.906185 30903 common.cpp:399] Device id:                     0
I0723 16:36:23.906189 30903 common.cpp:401] Device backend:                OpenCL
I0723 16:36:23.906213 30903 common.cpp:403] Backend details:               Intel(R) Corporation: OpenCL 1.2 LINUX
I0723 16:36:23.906244 30903 common.cpp:405] Device vendor:                 Intel(R) Corporation
I0723 16:36:23.906265 30903 common.cpp:407] Name:                          Intel(R) Core(TM) i7-6560U CPU @ 2.20GHz
I0723 16:36:23.906518 30903 common.cpp:409] Total global memory:           7917195264
I0723 16:36:23.906525 30903 common.cpp:399] Device id:                     1
I0723 16:36:23.906529 30903 common.cpp:401] Device backend:                OpenCL
I0723 16:36:23.906535 30903 common.cpp:403] Backend details:               Intel: OpenCL 1.2 beignet 1.2 (git-b55060c)
I0723 16:36:23.906539 30903 common.cpp:405] Device vendor:                 Intel
I0723 16:36:23.906543 30903 common.cpp:407] Name:                          Intel(R) HD Graphics Skylake ULT GT3
I0723 16:36:23.906565 30903 common.cpp:409] Total global memory:           3958374400

gongzg commented 8 years ago

@naibaf7 could you share the average forward time and backward time. The backward time is really slow. For am example. on my BDW GT2 machine, what I got from benchmark64.prototxt is: average forward pass: 832ms. average backward pass: 3834ms.

I believe libDNN engine should be much faster at backward pass.

gongzg commented 8 years ago

@naibaf7 I just did a test on a BDW GT3e machine, get the following performance number with spatial convolution engine: average forward time: 277.8ms average backward time: 1139ms This GPU should be very close to your machine, but I'm using the OpenCL SDK. Will find a SKL GT3 machine next week and use beignet to do some test.

naibaf7 commented 8 years ago

@gongzg That's interesting, so the BDW GT2 is faster than the Skylake? Which engine did you use with the numbers you posted there? ViennaCL BLAS per layer:

I0723 16:53:57.989781   658 caffe.cpp:450] Average time per layer: 
I0723 16:53:57.989796   658 caffe.cpp:453]       data   forward: 0.100127 ms.
I0723 16:53:57.989814   658 caffe.cpp:456]       data   backward: 0.097255 ms.
I0723 16:53:57.989830   658 caffe.cpp:453]      label   forward: 0.0986646 ms.
I0723 16:53:57.989845   658 caffe.cpp:456]      label   backward: 0.116754 ms.
I0723 16:53:57.989859   658 caffe.cpp:453]      conv1   forward: 197.391 ms.
I0723 16:53:57.989874   658 caffe.cpp:456]      conv1   backward: 339.864 ms.
I0723 16:53:57.989889   658 caffe.cpp:453]      relu1   forward: 7.03235 ms.
I0723 16:53:57.989902   658 caffe.cpp:456]      relu1   backward: 13.2872 ms.
I0723 16:53:57.989917   658 caffe.cpp:453]      norm1   forward: 28.9102 ms.
I0723 16:53:57.989931   658 caffe.cpp:456]      norm1   backward: 40.849 ms.
I0723 16:53:57.989946   658 caffe.cpp:453]      pool1   forward: 6.65354 ms.
I0723 16:53:57.989959   658 caffe.cpp:456]      pool1   backward: 16.7646 ms.
I0723 16:53:57.989974   658 caffe.cpp:453]      conv2   forward: 447.242 ms.
I0723 16:53:57.989987   658 caffe.cpp:456]      conv2   backward: 833.264 ms.
I0723 16:53:57.990001   658 caffe.cpp:453]      relu2   forward: 3.9086 ms.
I0723 16:53:57.990015   658 caffe.cpp:456]      relu2   backward: 7.75015 ms.
I0723 16:53:57.990030   658 caffe.cpp:453]      norm2   forward: 17.3098 ms.
I0723 16:53:57.990042   658 caffe.cpp:456]      norm2   backward: 23.2308 ms.
I0723 16:53:57.990057   658 caffe.cpp:453]      pool2   forward: 3.84905 ms.
I0723 16:53:57.990070   658 caffe.cpp:456]      pool2   backward: 10.9517 ms.
I0723 16:53:57.990084   658 caffe.cpp:453]      conv3   forward: 314.533 ms.
I0723 16:53:57.990099   658 caffe.cpp:456]      conv3   backward: 488.007 ms.
I0723 16:53:57.990113   658 caffe.cpp:453]      relu3   forward: 1.30731 ms.
I0723 16:53:57.990128   658 caffe.cpp:456]      relu3   backward: 2.15554 ms.
I0723 16:53:57.990159   658 caffe.cpp:453]      conv4   forward: 317.564 ms.
I0723 16:53:57.990175   658 caffe.cpp:456]      conv4   backward: 412.875 ms.
I0723 16:53:57.990190   658 caffe.cpp:453]      relu4   forward: 1.27252 ms.
I0723 16:53:57.990203   658 caffe.cpp:456]      relu4   backward: 1.99165 ms.
I0723 16:53:57.990216   658 caffe.cpp:453]      conv5   forward: 291.408 ms.
I0723 16:53:57.990231   658 caffe.cpp:456]      conv5   backward: 303.204 ms.
I0723 16:53:57.990242   658 caffe.cpp:453]      relu5   forward: 0.952221 ms.
I0723 16:53:57.990252   658 caffe.cpp:456]      relu5   backward: 1.61467 ms.
I0723 16:53:57.990262   658 caffe.cpp:453]      pool5   forward: 1.00883 ms.
I0723 16:53:57.990278   658 caffe.cpp:456]      pool5   backward: 2.92136 ms.
I0723 16:53:57.990291   658 caffe.cpp:453]        fc6   forward: 45.956 ms.
I0723 16:53:57.990309   658 caffe.cpp:456]        fc6   backward: 115.107 ms.
I0723 16:53:57.990329   658 caffe.cpp:453]      relu6   forward: 0.345521 ms.
I0723 16:53:57.990345   658 caffe.cpp:456]      relu6   backward: 0.37321 ms.
I0723 16:53:57.990361   658 caffe.cpp:453]      drop6   forward: 5.01316 ms.
I0723 16:53:57.990378   658 caffe.cpp:456]      drop6   backward: 0.438744 ms.
I0723 16:53:57.990397   658 caffe.cpp:453]        fc7   forward: 20.8769 ms.
I0723 16:53:57.990417   658 caffe.cpp:456]        fc7   backward: 52.161 ms.
I0723 16:53:57.990435   658 caffe.cpp:453]      relu7   forward: 0.306402 ms.
I0723 16:53:57.990453   658 caffe.cpp:456]      relu7   backward: 0.348346 ms.
I0723 16:53:57.990468   658 caffe.cpp:453]      drop7   forward: 3.98771 ms.
I0723 16:53:57.990483   658 caffe.cpp:456]      drop7   backward: 0.354259 ms.
I0723 16:53:57.990494   658 caffe.cpp:453]        fc8   forward: 9.74994 ms.
I0723 16:53:57.990505   658 caffe.cpp:456]        fc8   backward: 13.3348 ms.
I0723 16:53:57.990514   658 caffe.cpp:453]       loss   forward: 1.94023 ms.
I0723 16:53:57.990525   658 caffe.cpp:456]       loss   backward: 0.461751 ms.
I0723 16:53:57.990648   658 caffe.cpp:461] Average Forward pass: 1738.39 ms.
I0723 16:53:57.990667   658 caffe.cpp:463] Average Backward pass: 2691.41 ms.
I0723 16:53:57.990722   658 caffe.cpp:465] Average Forward-Backward: 4431.83 ms.
I0723 16:53:57.990741   658 caffe.cpp:467] Total Time: 22159.2 ms.

Intel spatial:

I0723 17:02:54.607259  1730 caffe.cpp:450] Average time per layer: 
I0723 17:02:54.607276  1730 caffe.cpp:453]       data   forward: 0.116164 ms.
I0723 17:02:54.607296  1730 caffe.cpp:456]       data   backward: 0.123356 ms.
I0723 17:02:54.607314  1730 caffe.cpp:453]      label   forward: 0.107694 ms.
I0723 17:02:54.607331  1730 caffe.cpp:456]      label   backward: 0.188441 ms.
I0723 17:02:54.607347  1730 caffe.cpp:453]      conv1   forward: 443.612 ms.
I0723 17:02:54.607363  1730 caffe.cpp:456]      conv1   backward: 427.156 ms.
I0723 17:02:54.607379  1730 caffe.cpp:453]      relu1   forward: 8.7127 ms.
I0723 17:02:54.607419  1730 caffe.cpp:456]      relu1   backward: 15.2398 ms.
I0723 17:02:54.607455  1730 caffe.cpp:453]      norm1   forward: 41.9368 ms.
I0723 17:02:54.607475  1730 caffe.cpp:456]      norm1   backward: 62.7724 ms.
I0723 17:02:54.607496  1730 caffe.cpp:453]      pool1   forward: 9.26116 ms.
I0723 17:02:54.607522  1730 caffe.cpp:456]      pool1   backward: 28.762 ms.
I0723 17:02:54.607568  1730 caffe.cpp:453]      conv2   forward: 1657.64 ms.
I0723 17:02:54.607631  1730 caffe.cpp:456]      conv2   backward: 1108.42 ms.
I0723 17:02:54.607692  1730 caffe.cpp:453]      relu2   forward: 7.24185 ms.
I0723 17:02:54.607743  1730 caffe.cpp:456]      relu2   backward: 10.7396 ms.
I0723 17:02:54.607791  1730 caffe.cpp:453]      norm2   forward: 28.8983 ms.
I0723 17:02:54.607834  1730 caffe.cpp:456]      norm2   backward: 36.666 ms.
I0723 17:02:54.607883  1730 caffe.cpp:453]      pool2   forward: 4.96558 ms.
I0723 17:02:54.607934  1730 caffe.cpp:456]      pool2   backward: 17.8944 ms.
I0723 17:02:54.608018  1730 caffe.cpp:453]      conv3   forward: 835.374 ms.
I0723 17:02:54.608065  1730 caffe.cpp:456]      conv3   backward: 658.829 ms.
I0723 17:02:54.608108  1730 caffe.cpp:453]      relu3   forward: 1.92744 ms.
I0723 17:02:54.608126  1730 caffe.cpp:456]      relu3   backward: 4.70381 ms.
I0723 17:02:54.608196  1730 caffe.cpp:453]      conv4   forward: 807.812 ms.
I0723 17:02:54.608238  1730 caffe.cpp:456]      conv4   backward: 568.898 ms.
I0723 17:02:54.608268  1730 caffe.cpp:453]      relu4   forward: 3.42795 ms.
I0723 17:02:54.608296  1730 caffe.cpp:456]      relu4   backward: 3.19371 ms.
I0723 17:02:54.608325  1730 caffe.cpp:453]      conv5   forward: 625.251 ms.
I0723 17:02:54.608355  1730 caffe.cpp:456]      conv5   backward: 432.836 ms.
I0723 17:02:54.608387  1730 caffe.cpp:453]      relu5   forward: 1.42619 ms.
I0723 17:02:54.608417  1730 caffe.cpp:456]      relu5   backward: 3.28224 ms.
I0723 17:02:54.608445  1730 caffe.cpp:453]      pool5   forward: 1.41549 ms.
I0723 17:02:54.608475  1730 caffe.cpp:456]      pool5   backward: 3.6956 ms.
I0723 17:02:54.608507  1730 caffe.cpp:453]        fc6   forward: 67.7576 ms.
I0723 17:02:54.608623  1730 caffe.cpp:456]        fc6   backward: 158.356 ms.
I0723 17:02:54.608661  1730 caffe.cpp:453]      relu6   forward: 0.383532 ms.
I0723 17:02:54.608695  1730 caffe.cpp:456]      relu6   backward: 0.39943 ms.
I0723 17:02:54.608728  1730 caffe.cpp:453]      drop6   forward: 5.45477 ms.
I0723 17:02:54.608758  1730 caffe.cpp:456]      drop6   backward: 0.501933 ms.
I0723 17:02:54.608789  1730 caffe.cpp:453]        fc7   forward: 36.5435 ms.
I0723 17:02:54.608824  1730 caffe.cpp:456]        fc7   backward: 73.022 ms.
I0723 17:02:54.608857  1730 caffe.cpp:453]      relu7   forward: 0.376915 ms.
I0723 17:02:54.608889  1730 caffe.cpp:456]      relu7   backward: 0.325761 ms.
I0723 17:02:54.608927  1730 caffe.cpp:453]      drop7   forward: 4.93873 ms.
I0723 17:02:54.608959  1730 caffe.cpp:456]      drop7   backward: 0.372964 ms.
I0723 17:02:54.609012  1730 caffe.cpp:453]        fc8   forward: 14.0754 ms.
I0723 17:02:54.609050  1730 caffe.cpp:456]        fc8   backward: 22.107 ms.
I0723 17:02:54.609099  1730 caffe.cpp:453]       loss   forward: 2.38256 ms.
I0723 17:02:54.609174  1730 caffe.cpp:456]       loss   backward: 0.419859 ms.
I0723 17:02:54.609408  1730 caffe.cpp:461] Average Forward pass: 4624.77 ms.
I0723 17:02:54.609439  1730 caffe.cpp:463] Average Backward pass: 3652.86 ms.
I0723 17:02:54.609529  1730 caffe.cpp:465] Average Forward-Backward: 8282.62 ms.
I0723 17:02:54.609557  1730 caffe.cpp:467] Total Time: 41413.1 ms.

The numbers are vastly different from yours, so I believe there must be something wrong.

gongzg commented 8 years ago

@naibaf7 oh, definitely No. Your SKL machine should be much faster than my GT2 machine, and should be comparable with the GT3e machine or even faster. From the log you paste above: I0723 17:02:54.607347 1730 caffe.cpp:453] conv1 forward: 443.612 ms. I0723 17:02:54.607363 1730 caffe.cpp:456] conv1 backward: 427.156 ms.

I highly doubt whether you were really using the spatial engine. You can easily uncomment the following code in the spatial convolution source code // #define dbg

Then, please remove .spatialkernels/* and re-run the benchmark. It will show the tuning process and print GFLOPS for each tuned kernel and the final winner kernel.

naibaf7 commented 8 years ago

@gongzg It should be using the spatial kernels, as it had a really long time tuning on the first run. But here we go, the output does not look good:

Verification was not successful, fallback to basic kernel
Bechmarking kernel: U5_5_96_2_1_1_1_31_31_64_2_128_1_1_1_1_BASIC
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."
    Estimated Gflops:28.6535
    Estimated GFLOPS/S: 21.8862
Convolution Time:1309.2

gongzg commented 8 years ago

@naibaf7 Thanks for the log. And now I know the reason:

Verification was not successful, fallback to basic kernel Bechmarking kernel: U5_5_96_2_1_1_1_31_31_64_2_128_1_1_1_1_BASIC

The beignet is broken at your system which can't get correct result with the optimized spatial kernel and fall back to the naive basic kernel. That's the reason why you get bad performance number. We may need beignet team's support again to find out why your beignet is broken.

naibaf7 commented 8 years ago

@gongzg Yeah, it's much more difficult to get to work than I thought it would be... Do you know what the status is on Skylake (Iris Pro) beignet? Basically all Caffe tests pass fine, LibDNN verifies correctly, no issues there (with both beignet-1.1.1 that comes with Fedora 24 and the beignet-1.2 that I compiled from the current beignet-master). But the intel spatial convolution does not pass verification with this setup so far. From what I can tell, the difference is that the spatial convolution uses Intel specific extensions, and these do not seem to work?

bhack commented 8 years ago

@naibaf7 See the devices list https://cgit.freedesktop.org/beignet/tree/src/cl_device_id.c

bhack commented 8 years ago

Intel extensions in beignet are in https://cgit.freedesktop.org/beignet/tree/include/CL/cl_intel.h

naibaf7 commented 8 years ago

@bhack Yeah, it''s "Intel(R) HD Graphics Skylake ULT GT3" which is in the list, should be fine. Thanks.

bhack commented 8 years ago

I think that the problem could be on libdrm and kernel version.. What versions of both are you using?

naibaf7 commented 8 years ago

Kernel: 4.6.4-301.fc24.x86_64 Libdrm:

Package libdrm-devel-2.4.68-1.fc24.x86_64
Package libdrm-debuginfo-2.4.61-3.fc22.x86_64
Package libdrm-2.4.68-1.fc24.x86_64
Package libdrm-2.4.68-1.fc24.i686

bhack commented 8 years ago

Mhh.. Can you add a print of fixed_local_sz[i] inside the loop and before modulo at https://cgit.freedesktop.org/beignet/tree/src/cl_api.c#n3031

naibaf7 commented 8 years ago

@gongzg I now tested this with 3 different versions of LLVM and Clang as you suggested:

3.6.2 built from scratch/source
3.7.4 from the Fedora 23 update repository
3.8.x from the Fedora 24 update repository

Cleaned out the .spatialkernels folder for every test, but same result.

Driver is xorg-x11-drv-intel-2.99.917-23.20160512.fc24.x86_64 by the way. Could that be the issue?

bhack commented 8 years ago

Have you tried to debug/print that loop?

bhack commented 8 years ago

I don't know if this Beignet Workgroup guide is still valid.

naibaf7 commented 8 years ago

@bhack Oh sorry I missed your comment on printing the loop, I will follow up on that.

bhack commented 8 years ago

It is important to check if realGroupSize *= fixed_local_sz[i]; it is cumulated correctly. If you have compiled with debug symbols you can check also with gdb break points.

gongzg commented 8 years ago

@bhack For the alexnet, the intel spatial convolution kernel always use a 1,1,16 group size which is valid for beignet. @naibaf7 I tested the benchmark64 on a SKL GT2 machine with the latest gitmaster beignet with LLVM 3.6 and got the following result:

I0725 04:35:59.524305 32761 caffe.cpp:448] Average time per layer: I0725 04:35:59.524317 32761 caffe.cpp:451] data forward: 0.088519 ms. I0725 04:35:59.524334 32761 caffe.cpp:454] data backward: 0.0850156 ms. I0725 04:35:59.524350 32761 caffe.cpp:451] label forward: 0.0851441 ms. I0725 04:35:59.524363 32761 caffe.cpp:454] label backward: 0.118466 ms. I0725 04:35:59.524376 32761 caffe.cpp:451] conv1 forward: 58.4319 ms. I0725 04:35:59.524392 32761 caffe.cpp:454] conv1 backward: 329.87 ms. I0725 04:35:59.524408 32761 caffe.cpp:451] relu1 forward: 6.13321 ms. I0725 04:35:59.524421 32761 caffe.cpp:454] relu1 backward: 9.00438 ms. I0725 04:35:59.524435 32761 caffe.cpp:451] norm1 forward: 31.4477 ms. I0725 04:35:59.524451 32761 caffe.cpp:454] norm1 backward: 38.0522 ms. I0725 04:35:59.524463 32761 caffe.cpp:451] pool1 forward: 7.27156 ms. I0725 04:35:59.524477 32761 caffe.cpp:454] pool1 backward: 25.028 ms. I0725 04:35:59.524490 32761 caffe.cpp:451] conv2 forward: 186.484 ms. I0725 04:35:59.524507 32761 caffe.cpp:454] conv2 backward: 1686.54 ms. I0725 04:35:59.524520 32761 caffe.cpp:451] relu2 forward: 3.97442 ms. I0725 04:35:59.524533 32761 caffe.cpp:454] relu2 backward: 5.8414 ms. I0725 04:35:59.524545 32761 caffe.cpp:451] norm2 forward: 19.9107 ms. I0725 04:35:59.524560 32761 caffe.cpp:454] norm2 backward: 23.4814 ms. I0725 04:35:59.524574 32761 caffe.cpp:451] pool2 forward: 4.57914 ms. I0725 04:35:59.524586 32761 caffe.cpp:454] pool2 backward: 16.3685 ms. I0725 04:35:59.524600 32761 caffe.cpp:451] conv3 forward: 68.6992 ms. I0725 04:35:59.524616 32761 caffe.cpp:454] conv3 backward: 628.469 ms. I0725 04:35:59.524629 32761 caffe.cpp:451] relu3 forward: 1.4288 ms. I0725 04:35:59.524641 32761 caffe.cpp:454] relu3 backward: 2.28515 ms. I0725 04:35:59.524654 32761 caffe.cpp:451] conv4 forward: 55.6638 ms. I0725 04:35:59.524669 32761 caffe.cpp:454] conv4 backward: 512.247 ms. I0725 04:35:59.524683 32761 caffe.cpp:451] relu4 forward: 1.46054 ms. I0725 04:35:59.524695 32761 caffe.cpp:454] relu4 backward: 2.3425 ms. I0725 04:35:59.524708 32761 caffe.cpp:451] conv5 forward: 38.6343 ms. I0725 04:35:59.524724 32761 caffe.cpp:454] conv5 backward: 365.608 ms. I0725 04:35:59.524739 32761 caffe.cpp:451] relu5 forward: 0.998164 ms. I0725 04:35:59.524751 32761 caffe.cpp:454] relu5 backward: 1.74181 ms. I0725 04:35:59.524765 32761 caffe.cpp:451] pool5 forward: 1.24395 ms. I0725 04:35:59.524777 32761 caffe.cpp:454] pool5 backward: 3.99459 ms. I0725 04:35:59.524790 32761 caffe.cpp:451] fc6 forward: 68.0091 ms. I0725 04:35:59.524806 32761 caffe.cpp:454] fc6 backward: 153.708 ms. I0725 04:35:59.524821 32761 caffe.cpp:451] relu6 forward: 0.352468 ms. I0725 04:35:59.524834 32761 caffe.cpp:454] relu6 backward: 0.365035 ms. I0725 04:35:59.524847 32761 caffe.cpp:451] drop6 forward: 4.93038 ms. I0725 04:35:59.524860 32761 caffe.cpp:454] drop6 backward: 0.368603 ms. I0725 04:35:59.524879 32761 caffe.cpp:451] fc7 forward: 29.1046 ms. I0725 04:35:59.524902 32761 caffe.cpp:454] fc7 backward: 69.5503 ms. I0725 04:35:59.524927 32761 caffe.cpp:451] relu7 forward: 0.271047 ms. I0725 04:35:59.524941 32761 caffe.cpp:454] relu7 backward: 0.311824 ms. I0725 04:35:59.524955 32761 caffe.cpp:451] drop7 forward: 3.00953 ms. I0725 04:35:59.524968 32761 caffe.cpp:454] drop7 backward: 0.337674 ms. I0725 04:35:59.524981 32761 caffe.cpp:451] fc8 forward: 9.80128 ms. I0725 04:35:59.524994 32761 caffe.cpp:454] fc8 backward: 17.2192 ms. I0725 04:35:59.525054 32761 caffe.cpp:451] loss forward: 1.44299 ms. I0725 04:35:59.525071 32761 caffe.cpp:454] loss backward: 0.307409 ms. I0725 04:35:59.525177 32761 caffe.cpp:459] Average Forward pass: 606.389 ms. I0725 04:35:59.525205 32761 caffe.cpp:461] Average Backward pass: 3901.76 ms. I0725 04:35:59.525254 32761 caffe.cpp:463] Average Forward-Backward: 4509.35 ms. I0725 04:35:59.525277 32761 caffe.cpp:465] Total Time: 45093.5 ms. I0725 04:35:59.525291 32761 caffe.cpp:466] * Benchmark ends *

The clinfo: Number of platforms 1 Platform Name Intel Gen OCL Driver Platform Vendor Intel Platform Version OpenCL 1.2 beignet 1.2 (git-b55060c) Platform Profile FULL_PROFILE Platform Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_spir cl_khr_icd cl_intel_accelerator cl_intel_motion_estimation cl_intel_subgroups Platform Extensions function suffix Intel

Platform Name Intel Gen OCL Driver Number of devices 1 Device Name Intel(R) HD Graphics Skylake Desktop GT2 Device Vendor Intel Device Vendor ID 0x8086 Device Version OpenCL 1.2 beignet 1.2 (git-b55060c) Driver Version 1.2 Device OpenCL C Version OpenCL C 1.2 beignet 1.2 (git-b55060c) Device Type GPU Device Profile FULL_PROFILE Max compute units 24 Max clock frequency 1000MHz Device Partition (core) Max number of sub-devices 1 Supported partition types None, None, None Max work item dimensions 3 Max work item sizes 512x512x512 Max work group size 512 Preferred work group size multiple 16

Kernel information: Linux gongzg-skl 4.6.2-040602-generic #201606100516 SMP Fri Jun 10 09:18:34 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

So it seems that beignet works fine with some SKL platforms under the above configurations. I will work with beignet team to try to reproduce your environment and issues.

gongzg commented 8 years ago

@naibaf7 could you share the latest clinfo of your machine here? I saw the clinfo (clinfo_after) you sent to me last week, there is one clover device and one Intel CPU device.

bhack commented 8 years ago

@gongzg How can enter in https://cgit.freedesktop.org/beignet/tree/src/cl_api.c#n3036 if local_work_size is not NULL?

gongzg commented 8 years ago

@bhack those output message should not come from the spatial convolution kernel and should from some other kernels. The spatial convolution kernels don't use null kernel size.

bhack commented 8 years ago

Ok so probably this message was generated by autotuning code. Where is "Verification was not successful, fallback to basic kernel" in code?

gongzg commented 8 years ago

@bhack This warning message is in caffe's spatial convolution file in the function: void ConvolutionLayerSpatial::setup_convolution().

naibaf7 commented 8 years ago

@gongzg I removed the "mesa clover" ICD on that system, it was creating issues with device initialization. The clinfo is now:

Number of platforms                               2
  Platform Name                                   Intel(R) OpenCL
  Platform Vendor                                 Intel(R) Corporation
  Platform Version                                OpenCL 1.2 LINUX
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_spir cl_intel_exec_by_local_thread cl_khr_depth_images cl_khr_3d_image_writes cl_khr_fp64 
  Platform Extensions function suffix             INTEL

  Platform Name                                   Intel Gen OCL Driver
  Platform Vendor                                 Intel
  Platform Version                                OpenCL 1.2 beignet 1.2 (git-b55060c)
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_spir cl_khr_icd cl_intel_accelerator cl_intel_motion_estimation cl_intel_subgroups
  Platform Extensions function suffix             Intel

  Platform Name                                   Intel(R) OpenCL
Number of devices                                 1
  Device Name                                     Intel(R) Core(TM) i7-6560U CPU @ 2.20GHz
  Device Vendor                                   Intel(R) Corporation
  Device Vendor ID                                0x8086
  Device Version                                  OpenCL 1.2 (Build 8)
  Driver Version                                  1.2.0.8
  Device OpenCL C Version                         OpenCL C 1.2 
  Device Type                                     CPU
  Device Profile                                  FULL_PROFILE
  Max compute units                               4
  Max clock frequency                             2200MHz
  Device Partition                                (core)
    Max number of sub-devices                     4
    Supported partition types                     by counts, equally, by names (Intel)
  Max work item dimensions                        3
  Max work item sizes                             8192x8192x8192
  Max work group size                             8192
  Preferred work group size multiple              128
  Preferred / native vector sizes                 
    char                                                 1 / 32      
    short                                                1 / 16      
    int                                                  1 / 8       
    long                                                 1 / 4       
    half                                                 0 / 0        (n/a)
    float                                                1 / 8       
    double                                               1 / 4        (cl_khr_fp64)
  Half-precision Floating-point support           (n/a)
  Single-precision Floating-point support         (core)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 No
    Round to infinity                             No
    IEEE754-2008 fused multiply-add               No
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  No
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  No
  Address bits                                    64, Little-Endian
  Global memory size                              7917195264 (7.373GiB)
  Error Correction support                        No
  Max memory allocation                           1979298816 (1.843GiB)
  Unified memory for Host and Device              Yes
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       1024 bits (128 bytes)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        262144
  Global Memory cache line                        64 bytes
  Image support                                   Yes
    Max number of samplers per kernel             480
    Max size for 1D images from buffer            123706176 pixels
    Max 1D or 2D image array size                 2048 images
    Max 2D image size                             16384x16384 pixels
    Max 3D image size                             2048x2048x2048 pixels
    Max number of read image args                 480
    Max number of write image args                480
  Local memory type                               Global
  Local memory size                               32768 (32KiB)
  Max constant buffer size                        131072 (128KiB)
  Max number of constant args                     480
  Max size of kernel argument                     3840 (3.75KiB)
  Queue properties                                
    Out-of-order execution                        Yes
    Profiling                                     Yes
    Local thread execution (Intel)                Yes
  Prefer user sync for interop                    No
  Profiling timer resolution                      1ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            Yes
    SPIR versions                                 1.2
  printf() buffer size                            1048576 (1024KiB)
  Built-in kernels                                
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Device Extensions                               cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_spir cl_intel_exec_by_local_thread cl_khr_depth_images cl_khr_3d_image_writes cl_khr_fp64 

  Platform Name                                   Intel Gen OCL Driver
Number of devices                                 1
  Device Name                                     Intel(R) HD Graphics Skylake ULT GT3
  Device Vendor                                   Intel
  Device Vendor ID                                0x8086
  Device Version                                  OpenCL 1.2 beignet 1.2 (git-b55060c)
  Driver Version                                  1.2
  Device OpenCL C Version                         OpenCL C 1.2 beignet 1.2 (git-b55060c)
  Device Type                                     GPU
  Device Profile                                  FULL_PROFILE
  Max compute units                               48
  Max clock frequency                             1000MHz
  Device Partition                                (core)
    Max number of sub-devices                     1
    Supported partition types                     None, None, None
  Max work item dimensions                        3
  Max work item sizes                             512x512x512
  Max work group size                             512
  Preferred work group size multiple              16
  Preferred / native vector sizes                 
    char                                                16 / 8       
    short                                                8 / 8       
    int                                                  4 / 4       
    long                                                 2 / 2       
    half                                                 0 / 8        (cl_khr_fp16)
    float                                                4 / 4       
    double                                               0 / 2        (n/a)
  Half-precision Floating-point support           (cl_khr_fp16)
    Denormals                                     No
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 No
    Round to infinity                             No
    IEEE754-2008 fused multiply-add               No
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  No
  Single-precision Floating-point support         (core)
    Denormals                                     No
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 No
    Round to infinity                             No
    IEEE754-2008 fused multiply-add               No
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  No
  Double-precision Floating-point support         (n/a)
  Address bits                                    32, Little-Endian
  Global memory size                              3958374400 (3.687GiB)
  Error Correction support                        No
  Max memory allocation                           2968518656 (2.765GiB)
  Unified memory for Host and Device              Yes
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       1024 bits (128 bytes)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        8192
  Global Memory cache line                        64 bytes
  Image support                                   Yes
    Max number of samplers per kernel             16
    Max size for 1D images from buffer            65536 pixels
    Max 1D or 2D image array size                 2048 images
    Base address alignment for 2D image buffers   4096 bytes
    Pitch alignment for 2D image buffers          1 bytes
    Max 2D image size                             8192x8192 pixels
    Max 3D image size                             8192x8192x2048 pixels
    Max number of read image args                 128
    Max number of write image args                8
  Local memory type                               Global
  Local memory size                               65536 (64KiB)
  Max constant buffer size                        134217728 (128MiB)
  Max number of constant args                     8
  Max size of kernel argument                     1024
  Queue properties                                
    Out-of-order execution                        No
    Profiling                                     Yes
  Prefer user sync for interop                    Yes
  Profiling timer resolution                      80ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            Yes
    SPIR versions                                 1.2
  printf() buffer size                            1048576 (1024KiB)
  Built-in kernels                                __cl_copy_region_align4;__cl_copy_region_align16;__cl_cpy_region_unalign_same_offset;__cl_copy_region_unalign_dst_offset;__cl_copy_region_unalign_src_offset;__cl_copy_buffer_rect;__cl_copy_image_1d_to_1d;__cl_copy_image_2d_to_2d;__cl_copy_image_3d_to_2d;__cl_copy_image_2d_to_3d;__cl_copy_image_3d_to_3d;__cl_copy_image_2d_to_buffer;__cl_copy_image_3d_to_buffer;__cl_copy_buffer_to_image_2d;__cl_copy_buffer_to_image_3d;__cl_fill_region_unalign;__cl_fill_region_align2;__cl_fill_region_align4;__cl_fill_region_align8_2;__cl_fill_region_align8_4;__cl_fill_region_align8_8;__cl_fill_region_align8_16;__cl_fill_region_align128;__cl_fill_image_1d;__cl_fill_image_1d_array;__cl_fill_image_2d;__cl_fill_image_2d_array;__cl_fill_image_3d;
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Device Extensions                               cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_spir cl_khr_icd cl_intel_accelerator cl_intel_motion_estimation cl_intel_subgroups cl_khr_fp16

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  No platform
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   No platform
  clCreateContext(NULL, ...) [default]            No platform
  clCreateContext(NULL, ...) [other]              Success [INTEL]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  No platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  No platform

naibaf7 commented 7 years ago

@gongzg Currently trying to get this working again, but still no luck so far. For one, when compiling beignet 1.3, there is an issue finding CLANG and LLVM, FindLLVM.cmake, namely:

macro(add_one_lib name)
  FIND_LIBRARY(CLANG_LIB
    NAMES ${name}
    PATHS ${LLVM_LIBRARY_DIR} NO_DEFAULT_PATH)
  set(CLANG_LIBRARIES ${CLANG_LIBRARIES} ${CLANG_LIB})
    unset(CLANG_LIB CACHE)
endmacro()

assumes that the CLANG libraries will be found in the same path as LLVM, which is often not the case, so I wonder why it's not also serarching the default library folders as a secondary search path (/usr/lib or /usr/lib64, depending on if the system organizes with /usr/lib32 or not, and possibly also /usr/lib64/clang).

naibaf7 commented 7 years ago

@gongzg Now that I've got at least beignet-1.3 and/or beignet-1.2 working*, I get the following error:

Log: error: unknown argument: '-cl-no-subgroup-ifp'
stringInput.cl:833:16: warning: unknown attribute 'intel_reqd_sub_group_size' ignored

I only found these flags in the OpenCL 2.1 specifications. Is that the new requirement?

* for example by using:

sudo dnf install beignet --releasever=25

*or compiling by hand from the latest git master and a separate llvm/clang 3.7 installation.

gongzg commented 7 years ago

@naibaf7 I just fixed some compatiblity issue with beignet, and sent out the PR https://github.com/BVLC/caffe/pull/4734 to you for review. I tested it with beignet git master version, it works fine now.

naibaf7 commented 7 years ago

@gongzg The forward pass works very well with the Intel closed source driver. Here the performance numbers: AlexNet, batch size = 128, forward pass time = 995ms, so that's roughly 128 images/second. This is perfectly in-line with the numbers you posted above, although you used batch size = 64 and thus also had half the time (same throughput of about 128 inferences/second though as batch size 32 to 64 is already sufficient to fill up these GPUs).

But as I see it, your kernels will fallback to the default method for backward passes, is that right? Do you intend to also develop a few backward kernels, or should we use one of the libDNN kernels for backward passes when we integrate the spatial and gemm-like kernels into libDNN? They're not (yet) optimized for Intel chips, but a lot faster than clBLAS/ViennaCL/CLBlast serialized batch fallbacks (currently takes 11201ms).

Another point I want to talk about with you is parameter space of these implementations. We need to figure out and implement metadata as to which kernels can handle what kind of parameters (like 2D, 3D,...ND, dilation, stride).

gongzg commented 7 years ago

@naibaf7 Currently, I don't have a short-term plan to further optimize backward currently. But I have been optimizing the GEMM/GEMV performance for a while and the backward path depends on GEMM/GEMV implementation, so it could get some benefit. I have a much better internal implementation, and with these internal GEMM/GEMV, my performance number is as below:

I0920 03:38:40.866559 26513 caffe.cpp:453] data forward: 0.0476741 ms. I0920 03:38:40.866562 26513 caffe.cpp:456] data backward: 0.0590157 ms. I0920 03:38:40.866564 26513 caffe.cpp:453] label forward: 0.0455981 ms. I0920 03:38:40.866566 26513 caffe.cpp:456] label backward: 0.0479456 ms. I0920 03:38:40.866569 26513 caffe.cpp:453] conv1 forward: 82.3457 ms. I0920 03:38:40.866571 26513 caffe.cpp:456] conv1 backward: 153.344 ms. I0920 03:38:40.866575 26513 caffe.cpp:453] relu1 forward: 12.0988 ms. I0920 03:38:40.866576 26513 caffe.cpp:456] relu1 backward: 17.9105 ms. I0920 03:38:40.866578 26513 caffe.cpp:453] norm1 forward: 38.6664 ms. I0920 03:38:40.866581 26513 caffe.cpp:456] norm1 backward: 43.2145 ms. I0920 03:38:40.866583 26513 caffe.cpp:453] pool1 forward: 12.1539 ms. I0920 03:38:40.866585 26513 caffe.cpp:456] pool1 backward: 51.9181 ms. I0920 03:38:40.866590 26513 caffe.cpp:453] conv2 forward: 180.182 ms. I0920 03:38:40.866595 26513 caffe.cpp:456] conv2 backward: 567.339 ms. I0920 03:38:40.866600 26513 caffe.cpp:453] relu2 forward: 7.82132 ms. I0920 03:38:40.866602 26513 caffe.cpp:456] relu2 backward: 11.53 ms. I0920 03:38:40.866606 26513 caffe.cpp:453] norm2 forward: 24.545 ms. I0920 03:38:40.866611 26513 caffe.cpp:456] norm2 backward: 26.7601 ms. I0920 03:38:40.866614 26513 caffe.cpp:453] pool2 forward: 7.68509 ms. I0920 03:38:40.866617 26513 caffe.cpp:456] pool2 backward: 33.371 ms. I0920 03:38:40.866621 26513 caffe.cpp:453] conv3 forward: 116.501 ms. I0920 03:38:40.866626 26513 caffe.cpp:456] conv3 backward: 363.753 ms. I0920 03:38:40.866629 26513 caffe.cpp:453] relu3 forward: 2.69272 ms. I0920 03:38:40.866632 26513 caffe.cpp:456] relu3 backward: 4.20844 ms. I0920 03:38:40.866636 26513 caffe.cpp:453] conv4 forward: 95.4557 ms. I0920 03:38:40.866641 26513 caffe.cpp:456] conv4 backward: 321.725 ms. I0920 03:38:40.866646 26513 caffe.cpp:453] relu4 forward: 2.7257 ms. I0920 03:38:40.866649 26513 caffe.cpp:456] relu4 backward: 4.21281 ms. I0920 03:38:40.866653 26513 caffe.cpp:453] conv5 forward: 67.2799 ms. I0920 03:38:40.866658 26513 caffe.cpp:456] conv5 backward: 249.627 ms. I0920 03:38:40.866662 26513 caffe.cpp:453] relu5 forward: 1.80247 ms. I0920 03:38:40.866667 26513 caffe.cpp:456] relu5 backward: 2.91217 ms. I0920 03:38:40.866672 26513 caffe.cpp:453] pool5 forward: 1.93871 ms. I0920 03:38:40.866675 26513 caffe.cpp:456] pool5 backward: 7.94494 ms. I0920 03:38:40.866680 26513 caffe.cpp:453] fc6 forward: 38.2532 ms. I0920 03:38:40.866684 26513 caffe.cpp:456] fc6 backward: 85.3805 ms. I0920 03:38:40.866689 26513 caffe.cpp:453] relu6 forward: 0.257588 ms. I0920 03:38:40.866694 26513 caffe.cpp:456] relu6 backward: 0.330823 ms. I0920 03:38:40.866699 26513 caffe.cpp:453] drop6 forward: 1.9385 ms. I0920 03:38:40.866703 26513 caffe.cpp:456] drop6 backward: 1.01975 ms. I0920 03:38:40.866708 26513 caffe.cpp:453] fc7 forward: 19.4333 ms. I0920 03:38:40.866713 26513 caffe.cpp:456] fc7 backward: 39.0352 ms. I0920 03:38:40.866717 26513 caffe.cpp:453] relu7 forward: 0.258505 ms. I0920 03:38:40.866721 26513 caffe.cpp:456] relu7 backward: 0.314898 ms. I0920 03:38:40.866725 26513 caffe.cpp:453] drop7 forward: 1.94144 ms. I0920 03:38:40.866730 26513 caffe.cpp:456] drop7 backward: 0.316922 ms. I0920 03:38:40.866758 26513 caffe.cpp:453] fc8 forward: 5.75826 ms. I0920 03:38:40.866763 26513 caffe.cpp:456] fc8 backward: 10.1147 ms. I0920 03:38:40.866768 26513 caffe.cpp:453] loss forward: 1.47946 ms. I0920 03:38:40.866773 26513 caffe.cpp:456] loss backward: 0.252916 ms. I0920 03:38:40.866837 26513 caffe.cpp:461] Average Forward pass: 727.879 ms. I0920 03:38:40.866845 26513 caffe.cpp:463] Average Backward pass: 2002.14 ms. I0920 03:38:40.866852 26513 caffe.cpp:465] Average Forward-Backward: 2730.27 ms.

I am using your benchmark128.prototxt. And please be noted that my test machine is just a SKL GT2 machine which has 24 EUs. I already started the process to contribute the internal GEMM/GEMV implementation to ISAAC. Before that, I think you may try to use the libdnn to handle the convolution backward path and compare with my current performance number to see whether it's still worth to switch to libdnn for backward path.

As to the metadata part, I will update some comments to the .cl files latter to specifiy which parameters are supported by these implementation.

gongzg commented 7 years ago

@naibaf7 do you still have issues with beignet? If so, could you let me know details.

naibaf7 commented 7 years ago

@gongzg Seems fine with closed source drivers I will have to do some more tests though with the GPU clock (battery vs. powered) and different convolutions. As I want to do some code optimizations for the Intel GPU, I will just stick with that for the moment.

But with beignet still the issue as above:

Beignet: "unable to find good values for local_work_size[i], please provide local_work_size[] explicitly, you can find good values with trial-and-error method."

gongzg commented 7 years ago

@naibaf7 that's a warning message and should not cause fatal error. You can disable these warning message by building a release version beignet.

naibaf7 / caffe

Intel Beignet spatial convolution OpenCL compile failure #39

if !defined(ANDROID)

endif