lukeiwanski / tensorflow

OpenCL support for TensorFlow via SYCL
Apache License 2.0
65 stars 14 forks source link

FAIL: //tensorflow/core:util_stat_summarizer_test #127

Closed lukeiwanski closed 7 years ago

lukeiwanski commented 7 years ago

System Info

  Name:                      Hawaii
  Vendor:                    Advanced Micro Devices, Inc.
  Device OpenCL C version:           OpenCL C 2.0 
  Driver version:                1912.5 (VM)
  Profile:                   FULL_PROFILE
  Version:                   OpenCL 2.0 AMD-APP (1912.5)
  Extensions:                    cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_khr_gl_depth_images cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_image2d_from_buffer cl_khr_spir cl_khr_subgroups cl_khr_gl_event cl_khr_depth_images cl_khr_mipmap_image cl_khr_mipmap_image_writes

ComputeCpp 0.2.0

To reproduce

bazel test --config=sycl --local_test_jobs=4 -k --test_lang_filters=cc,py --action_env=LD_PRELOAD=/usr/lib/libOpenCL.so.1 --test_timeout 300,750,1200,3600 //tensorflow/core:util_stat_summarizer_test

Error

[  PASSED  ] 1 test.
external/bazel_tools/tools/test/test-setup.sh: line 168: 19794 Segmentation fault      (core dumped) "${TEST_PATH}" "$@"
jwlawson commented 7 years ago

This only happens very occasionally, around 5 out of 50 times the test is run. Add --runs_per_test=50 to bazel test to reproduce. The test is super quick to run, so running it this many times is not a problem.

jwlawson commented 7 years ago

Looks to be a seg fault inside the GSYCLInterface destructor, when destroying the Eigen::QueueInterface objects.

I can't get the crash to reproduce when the tests are run serially, only when there are at least 3 tests running at once.

jwlawson commented 7 years ago

Seg fault comes from the tensorflow::Buffer destructor. A tensorflow::Buffer is a wrapper around an array and a pointer to the allocator, essentially

struct Buffer {
  T* data;
  Alloc* alloc;
}

And the destructor uses the allocator to call alloc->deallocate(data). The problem here is that there are Tensors which are left on the SYCL device and the underlying buffer is only deleted at program exit. However there is then a race condition between deleting the buffer and deleting the SYCL allocators stored in GSYCLInterface. When the allocator is deleted first, the buffer will cause a segfault when it tries to deallocate its array.

jwlawson commented 7 years ago

Fixed by #136

lukeiwanski commented 7 years ago

Closing