naibaf7 / caffe

Caffe: a fast open framework for deep learning. With OpenCL and CUDA support.
http://caffe.berkeleyvision.org/
Other
85 stars 20 forks source link

Intel i5-5300U HD Graphics doesnt finish running #36

Open rajgott opened 8 years ago

rajgott commented 8 years ago

I have an i5-5300U and want to do inference on the integrated GPU. I can detect the GPU using clDeviceQuery. I compiled and installed Greentea with Intel OpenCL 1.2. clDeviceQuery.txt

I can test my model on CPU in ~1 second per image. When i switch to GPU mode the inference doesnt finish, i have waited 6+ hours. It shows CPU is running at close to 100% during this run.

Is this normal? Has anyone go it to work on intel integrated graphics?

naibaf7 commented 8 years ago

@rajgott Can you show me your Makefile.config and your network prototxt? In some instances, it could be that you run out of memory for the GPU, or that the convolution engine is not suitable for your GPU. On the 5300U, you should test if either enabling INTEL_SPATIAL or LIBDNN in the Makefile.config will work instead.

Also try to run:

./build/tools/caffe device_query

and post the result here. It is the "built-in clinfo" for OpenCL Caffe.

rajgott commented 8 years ago

Here is my Makefile.config Makefile.config.txt

This happens with standard models from bvlc, for example googlenet. INTEL_SPATIAL was enabled, now i enabled LIBDNN also. Even with this the problem remains.

Output of ./build/tools/caffe device_query: I0708 07:01:31.117813 6125 common.cpp:373] Total devices: 2 I0708 07:01:31.118115 6125 common.cpp:374] CUDA devices: 0 I0708 07:01:31.118127 6125 common.cpp:375] OpenCL devices: 2 I0708 07:01:31.118134 6125 common.cpp:399] Device id: 0 I0708 07:01:31.118144 6125 common.cpp:401] Device backend: OpenCL I0708 07:01:31.118168 6125 common.cpp:403] Backend details: Intel(R) Corporation: OpenCL 1.2 I0708 07:01:31.118181 6125 common.cpp:405] Device vendor: Intel(R) Corporation I0708 07:01:31.118191 6125 common.cpp:407] Name: Intel(R) HD Graphics I0708 07:01:31.118198 6125 common.cpp:409] Total global memory: 3427585229 I0708 07:01:31.118208 6125 common.cpp:399] Device id: 1 I0708 07:01:31.118216 6125 common.cpp:401] Device backend: OpenCL I0708 07:01:31.118224 6125 common.cpp:403] Backend details: Intel(R) Corporation: OpenCL 1.2 I0708 07:01:31.118240 6125 common.cpp:405] Device vendor: Intel(R) Corporation I0708 07:01:31.118249 6125 common.cpp:407] Name: Intel(R) Core(TM) i5-5300U CPU @ 2.30GHz I0708 07:01:31.118435 6125 common.cpp:409] Total global memory: 7245664256

Thanks

naibaf7 commented 8 years ago

Can you try with LIBDNN enabled but INTEL_SPATIAL disabled and also with both disabled? Just to be sure. From your information I can't think of another problem than a convolution that is stalling. Otherwise, can you try to run the network on --gpu1 instead of --gpu0 and see if at least OpenCL on CPU works? Make sure to do clean-builds (make clean, make all) before testing. You can also try to run ./build/test/test_all.testbin 0 ./build/test/test_all.testbin 1 to see if a certain layer gets stalled.

rajgott commented 8 years ago

I tried all combinations with LIBDNN and INTEL_SPATIAL, but inference still stalls.

./build/test/test_all.testbin 1. For example BatchNormLayerTest/2: [----------] 3 tests from BatchNormLayerTest/2, where TypeParam = caffe::GPUDevice [ RUN ] BatchNormLayerTest/2.TestForward [ OK ] BatchNormLayerTest/2.TestForward (2 ms) [ RUN ] BatchNormLayerTest/2.TestForwardInplace [ OK ] BatchNormLayerTest/2.TestForwardInplace (1 ms) [ RUN ] BatchNormLayerTest/2.TestGradient [ OK ] BatchNormLayerTest/2.TestGradient (15257 ms) [----------] 3 tests from BatchNormLayerTest/2 (15260 ms total) This segfaults after passing several tests. The output: test1.txt

./build/test/test_all.testbin 0 is much slower. For example BatchNormLayerTest/2: [----------] 3 tests from BatchNormLayerTest/2, where TypeParam = caffe::GPUDevice [ RUN ] BatchNormLayerTest/2.TestForward [ OK ] BatchNormLayerTest/2.TestForward (970 ms) [ RUN ] BatchNormLayerTest/2.TestForwardInplace [ OK ] BatchNormLayerTest/2.TestForwardInplace (68 ms) [ RUN ] BatchNormLayerTest/2.TestGradient [ OK ] BatchNormLayerTest/2.TestGradient (369108 ms) [----------] 3 tests from BatchNormLayerTest/2 (370146 ms total) This may take a while to finish more tests.

Thanks

naibaf7 commented 8 years ago

@rajgott Hmm, seems like INTEL_SPATIAL fails on your GPU (because it's not having the expected Skylake OpenCL features it seems).

I can't really find out what's wrong, seems you ran the runtest with INTEL_SPATIAL compiled. You should really compile it in the default configuration (LIBDNN and INTEL_SPATIAL off) to find out more...

rajgott commented 8 years ago

I have INTEL_SPATIAL and LIBDNN commented out as in the default Makefile.config

CMakeCache.txt shows this: //No help, variable specified on the command line. USE_INTEL_SPATIAL:UNINITIALIZED=OFF //Build Caffe with OpenCL libdnn USE_LIBDNN:BOOL=OFF

How else to confirm these two are disabled/enabled?

mattg-sp commented 8 years ago

I have the same CPU, and I experience the exact same failure on device 1. However, that's the CPU cores, which we don't plan to use. If I run that test case on device 0 (the real HD Graphics GPU), it passes. I am using the default setting of USE_INTEL_SPATIAL.

The problem exposed by the tests is that the GPU backend (device 0) runs many at a speed somewhere between 60x and 270x slower than device 1. During this time, the test is using about 6 % of a CPU core and the rest of its time seems to be io_wait.

naibaf7 commented 8 years ago

@mattg-sp OK. I got an Intel test platform now and will investigate this further. As a benchmark, can you please run this: ./build/tools/caffe time -model models/bvlc_alexnet/benchmark64.prototxt -gpu=0 -iterations=5

mattg-sp commented 8 years ago

First, thanks for all your help!

Second, that command has been running for 75 minutes, using 99% of a CPU core (virtually all user time). I've attached a profile and the callstacks of all the threads.

caffe_profile-time_gpu0_bvlc_alexnet-benchmark64.txt caffe_callstacks-time_gpu0_bvlc_alexnet-benchmark64.txt

naibaf7 commented 8 years ago

@mattg-sp Uh, that's not supposed to happen, especially not if you run it on the iGPU. Are you both using Macbooks by any chance? Can I have these details please:

It does not stall like that on either the i7-3632QM or i7-6560U integrated GPUs I use for testing (both on beignet-1.1 and Fedora 23/24).

@gongzg ideas?

mattg-sp commented 8 years ago

sumac:~/caffe # uname -a

Linux sumac 4.1.20-11-default #1 SMP PREEMPT Fri Mar 18 14:42:07 UTC 2016 (0a392b2) x86_64 x86_64 x86_64 GNU/Linux

sumac:~/caffe # cat /etc/os-release

NAME="SLES"
VERSION="12-SP1"
VERSION_ID="12.1"
PRETTY_NAME="SUSE Linux Enterprise Server 12 SP1"
ID="sles"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sles:12:sp1"

I'm having issues attaching files to this post, so I'll attach the rest of the requested details in a separate post.

gongzg commented 8 years ago

@naibaf7 @mattg-sp @rajgott , beignet seems have a cpu side performance issue with the gradient test case. It runs slower and slower during the iteration and seems to be stalled. Beignet team is investigating it. But you will not have that issue if you run the convnet-benchmark or the case Fabian mentioned above with INTEL_SPATIAL enabled. And for BDW, the recommended kernel version is 4.4 or newer. For SKL the recommended kernel version is 4.6 or newer.

mattg-sp commented 8 years ago

caffe-device_query.txt clinfo.txt

The hardware is actually an Intel NUC (model: NUC5i5MYHE) with 16 GB of RAM.

Actually, here's the top of /proc/meminfo:

MemTotal:       16300032 kB
MemFree:         6359192 kB
MemAvailable:   13040116 kB
mattg-sp commented 8 years ago

@gongzg thanks, but would that explain the behavior of caffe time -model models/bvlc_alexnet/benchmark64.prototxt?

Also, how do I know whether we're using beignet? The runtime I'm using is intel-linux-media-ocl_generic_16.4.4-47109_64bit.tar.gz, which I downloaded from Intel's website. Is that built on beignet?

mattg-sp commented 8 years ago

clinfo_dos.txt Here's the same clinfo, with DOS line endings.

naibaf7 commented 8 years ago

@mattg-sp Hmm, haven't tried with that package yet. Usually it's easiest to use https://www.freedesktop.org/wiki/Software/Beignet/ provided by the operating system (Ubuntu and Fedora have the "beignet" package in the repositories) on a recent kernel (4.4 or 4.6 as @gongzg pointed out).

Alternatively, try to compile the most recent beignet from source: https://cgit.freedesktop.org/beignet/ Instructions are here: https://gist.github.com/spiralray/cae0bc235509e495fec1

The installation is successful if you can find the "beignet-1.x" string in clinfo.

gongzg commented 8 years ago

@mattg-sp let's focus on one configuration at one time. I mean, all of my comments is for USE_INTEL_SPATIAL=ON. And I just saw your clinfo and confirmed that you are using the close source opencl compiler. But the version is a little bit out-of-date. Please change to use the latest published version at https://software.intel.com/en-us/articles/opencl-drivers#latest_linux_driver. The clinfo should be: Number of devices 1 Device Name Intel(R) HD Graphics Device Vendor Intel(R) Corporation Device Vendor ID 0x8086 Device Version OpenCL 1.2 Driver Version 1.0.47971 Device OpenCL C Version OpenCL C 1.2 ( using USC ) Device Type GPU

naibaf7 commented 8 years ago

@gongzg What's the difference from beignet to the closed source compiler? Can you elaborate why it even exists?

mattg-sp commented 8 years ago

I believe the closed source SDK came first. It's understandable why people want open source, though.

The reason we're using closed source is that we're also using Intel's Media SDK. I'll investigate whether beignet can be used in conjunction with that.

gongzg commented 8 years ago

@naibaf7 that's a little bit complicate. One of the reason is the open source version is for Linux only and the close source version is derived from Windows OCL driver. @mattg-sp Thanks for you explanation and yes, the closed source SDK for windows came first, then we have the open source for linux and then the OpenCL SDK begin to support Linux. I would stop this discussion here. Let's focus on the issue itself :).

If you want to use beignet, the recommended beignet version is git master, and the recommended LLVM version is LLVM 3.6. LLVM evolve very quickly and some times the newer version brings compatibility issue with beignet. You can check the link https://www.freedesktop.org/wiki/Software/Beignet/ which recommend to use LLVM 3.5 or 3.6.

If you have a BDW or HSW machine and want to use the OpenCL SDK, I would suggest the version "1.0.47971" which is what I am using for BDW machine right now and should not have any issue to run those benchmark.

If you have a SKL machine, you will have beignet support only so far.

mattg-sp commented 8 years ago

By upgrading my i915 device driver, I was able to resolve the issue of slow tests. Now, all unit tests pass on the GPU except for these:

Im2colLayerTest/2.TestDilatedGradient
Im2colLayerTest/2.TestDilatedGradientForceND
ConvolutionLayerTest/2.TestDilatedGradient

And those pass on the CPU device.

The benchmark64 still hangs on the GPU device, however. I will now investigate using the OpenCL 2.0 runtime and Beignet.

naibaf7 commented 8 years ago

@mattg-sp You can also use more lightweight versions of the benchmark - starting at benchmark1 and if that passes go up in batch size until you found the fastest performing batch size (which is the smallest batch size that fully exhausts the GPU cores - check scaling by dividing the time by the batch size).

mattg-sp commented 8 years ago

Thanks. I got up to benchmark32 to work on GPU 0 (Total Time: 12673.4 ms). Incidentally, it's about twice as fast as GPU 1 (Total Time: 25053.4 ms - on a CPU with 2 cores / 4 threads).

Wait... now benchmark64 even works. But I can still scroll back to last night and see the run that didn't work. Nothing changed, since then. No reboots, nor did I run or install anything until I started with benchmark1, this morning. I'm definitely not mistaken. I've checked over the parameters and I can clearly see that I canceled the failed run after 7m2.418s.

Update: even benchmark128 passed. Three out of three times, so far.

Maybe I'll reboot and see if I can get it to hang, again.

mattg-sp commented 8 years ago

Oh, I was also going to ask whether any benchmark data from different platforms is collected anywhere.

And are the unit test failures I mentioned a few posts ago anything to be concerned about? Are they likely to compromise the integrity of my results?

naibaf7 commented 8 years ago

@mattg-sp Yes, these failures basically mean you can't train correctly (gradients are wrong) with the Caffe convolution engine. You could check if LibDNN verification passes, but Intel spatial convolution uses the default engine for backward, and since the verification fails, it's not usable. What you could do is check how far the values are off. If they are only a little off, and fail the kappa-test just-so, you might be fine though.

Which device is GPU and which one is CPU on your system (0 and 1)? If 0 is the GPU, then that's not too bad :)