hughperkins / DeepCL

OpenCL library to train deep convolutional neural networks
Mozilla Public License 2.0
867 stars 199 forks source link

Unit tests fail #30

Closed pzawal closed 8 years ago

pzawal commented 8 years ago

Hi, I ran unit tests and ran into some errors. Compiled with Visual C++ 2015 x64 on Windows 7 with Radeon 7970.

Different errors: "unknown file: error: C++ exception with description "memallocsize too small to use this kernel on this device. Need: 0MB, but only have: -1984MB max alloc size" thrown in the test body. [ FAILED ] testforward.compare_1_n_biased_nopad (2230 ms)"

"error: Expected: (0.1f) >= (loss), actual: 0.1 vs 2.72727 clblas teardown [ FAILED ] testsimpleconvolvenet.imagesize_5_3_2layers_filtersize_3_3_biased_n18 (11310 ms)"

"ForwardAuto: kernel 5: this instance cant be used: For ForwardFc, filtersize and inputimagesize must be identical"

"Something went wrong, code -55" thrown in the test body. [ FAILED ] testbackward.compare_1_n_kgsgo_32c5 (843 ms)"

"ForwardAuto: kernel 2: this instance cant be used: cannot use forward2, since outputimagesize * outputimagesize > maxworkgroupsize"

Full log http://pastebin.com/v9d7ZMFu

Gpuinfo output:

platform index: 0: platform id: 000007FEDD2AF180 platform vendor: Advanced Micro Devices, Inc. platform name: AMD Accelerated Parallel Processing platform num devices: 2

device index: 0 device id: 0000000000379A20 device type: 4 global memory size: 3072MB local memory size: 32KB global cache size: 16KB global cacheline size: 64 max memory alloc size: 2112MB max compute units: 32 max workgroup size: 256 max workitem dimensions: 3 max workitem sizes: 256 256 256 device name: Tahiti opencl c version: OpenCL C 1.2 opencl device version: OpenCL 1.2 AMD-APP (1800.8) frequency MHz: 925

hughperkins commented 8 years ago

Thanks! Ok, there are a few different test failures there. I don't have a plan for how to fix them yet, but I'll go through the probable cause of each one:

Summary:

hughperkins commented 8 years ago

Hi. It looks like I was using a long for getting device info, which is a 64-bit integer on linux, but only 32-bits on Windows. I've changed it to int64_t, which is unambiguously 64-bit, on any platform, and committed to master as 9760689 . If you pull down the changes, this might fix the issue with '-1984MB', hopefully.

hughperkins commented 8 years ago

commited change f3f8829 , which skips kernel 2 in testforward.compare_1_n_biased_pad, if 19*19> maxworkgroupsize

hughperkins commented 8 years ago

Ah, it looks like the errorcode -55 error is actually a bug: missing a guard, and so the kernel crashes. Thank-you for pointing this error out!

Committed e52d1a2 , which adds a guard to one of the backprop kernels. I think the test will likely still fail, but the error message might be a bit more compact now.

hughperkins commented 8 years ago

I think that covers all the test failures you highlighted above. Can you pull down the updates, and retry please?

pzawal commented 8 years ago

The -1984MB issue is fixed. Code -55 still shows though, not sure if that's an issue.

http://pastebin.com/22Ap2Np7

hughperkins commented 8 years ago

Hi pzawal, I'm not seeing an error -55 test failure any more. I'm seeing the following failures:

I will look at making the tests skip these two kernels, for these specific geometries, on AMD. I dont think these test failures are indicating any fundamental problem with DeepCL, on your hardware/os configuration. At runtime, these kernels would simply be skipped, and other kernels used instead.

hughperkins commented 8 years ago

Ok, I've modified the tests to modify the geometry slightly for the kernels that are failing. I think it's good for the tests to run on these kernels anyway, to check correctness, though we cant test on incompatible geometries, hence modify the geometries slightly. Commit 7634e54

pzawal commented 8 years ago

There were no unit test failures due to code -55, I meant that it still shows in output in some subtests. I guess it's all ok then :) thanks

http://pastebin.com/eE9v811x

hughperkins commented 8 years ago

No, you're right actually, let's fix those. Committed 719e82b , which adds some guards, and hopefully removes the remaining error -55 s, from the output. Can you pull this down, and check if there are still any error -55s in the output?

pzawal commented 8 years ago

They disappeared, all nice now.

http://pastebin.com/zV9ASVqq

Btw, you should update your clBLAS fork to include clMathLibraries/clBLAS@78770945fdc2a772adee95152981ed28144e0836, fixes a minor vs 2015 compilation issue. I fixed it manually, only now noticed the linked clBLAS is to a fork.

hughperkins commented 8 years ago

They disappeared, all nice now.

Cool :-)

Btw, you should update your clBLAS fork to include clMathLibraries/clBLAS@7877094, fixes a minor vs 2015 compilation issue.

Ah, good info. Thanks! :-)