Closed mirh closed 5 years ago
Weird, I cannot reproduce the issue. The kernel that it is trying to compile is really simple so I see no reason why it would fail. Maybe having the full backtrace would help a little bit more. I don't think that would solve it but as a rule of thumb you should make sure to always execute the python scripts from outside the TF repository when you want to use the installed whl. On a side note ComputeCpp 0.9.0 has released today. We haven't tested it much with TF yet but it could solve your issue.
Ehrm.. nothing new under the sun. (and beginning and end of bt are kinda always the same - do you want the extended python information perhaps?)
Well, well, well we have updates here
#18 0x00007fffe733982f in clBuildProgram () from /usr/lib/libOpenCL.so.1
#19 0x00007fffe786ac0f in cl::sycl::detail::program::build_current_program(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool) () from /opt/ComputeCpp-CE-1.0.1-Ubuntu-16.04-x86_64/lib/libComputeCpp.so
#20 0x00007fffe786aeb2 in cl::sycl::detail::program::build(unsigned char const*, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool) () from /opt/ComputeCpp-CE-1.0.1-Ubuntu-16.04-x86_64/lib/libComputeCpp.so
#21 0x00007fffe787e1c2 in cl::sycl::detail::context::create_program_for_binary(std::shared_ptr<cl::sycl::detail::context> const&, unsigned char const*, int, bool) () from /opt/ComputeCpp-CE-1.0.1-Ubuntu-16.04-x86_64/lib/libComputeCpp.so
#22 0x00007fffe788504f in cl::sycl::program::create_program_for_kernel_impl(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned char const*, int, char const* const*, std::shared_ptr<cl::sycl::detail::context>, bool) ()
from /opt/ComputeCpp-CE-1.0.1-Ubuntu-16.04-x86_64/lib/libComputeCpp.so
#23 0x00007fffee915e2b in cl::sycl::program cl::sycl::program::create_program_for_kernel<tensorflow::functor::FillRandomKernel<tensorflow::random::TruncatedNormalDistribution<tensorflow::random::SingleSampleAdapter<tensorflow::random::PhiloxRandom>, float> > >(cl::sycl::context) () from /usr/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#24 0x00007fffee915700 in void cl::sycl::handler::parallel_for_impl<tensorflow::functor::FillRandomKernel<tensorflow::random::TruncatedNormalDistribution<tensorflow::random::SingleSampleAdapter<tensorflow::random::PhiloxRandom>, float> >, tensorflow::functor::FillPhiloxRandomKernel<tensorflow::random::TruncatedNormalDistribution<tensorflow::random::SingleSampleAdapter<tensorflow::random::PhiloxRandom>, float>, true> >(cl::sycl::detail::nd_range_base const&, tensorflow::functor::FillPhiloxRandomKernel<tensorflow::random::TruncatedNormalDistribution<tensorflow::random::SingleSampleAdapter<tensorflow::random::PhiloxRandom>, float>, true> const&) ()
from /usr/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#25 0x00007fffee9154db in tensorflow::functor::FillPhiloxRandom<Eigen::SyclDevice, tensorflow::random::TruncatedNormalDistribution<tensorflow::random::SingleSampleAdapter<tensorflow::random::PhiloxRandom>, float> >::operator()(tensorflow::OpKernelContext*, Eigen::SyclDevice const&, tensorflow::random::PhiloxRandom, float*, long long, tensorflow::random::TruncatedNormalDistribution<tensorflow::random::SingleSampleAdapter<tensorflow::random::PhiloxRandom>, float>)::{lambda(cl::sycl::handler&)#1}::operator()(cl::sycl::handler&) const () from /usr/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#26 0x00007fffee9152cf in cl::sycl::event cl::sycl::detail::command_group::submit_handler<tensorflow::functor::FillPhiloxRandom<Eigen::SyclDevice, tensorflow::random::TruncatedNormalDistribution<tensorflow::random::SingleSampleAdapter<tensorflow::random::PhiloxRandom>, float> >::operator()(tensorflow::OpKernelContext*, Eigen::SyclDevice const&, tensorflow::random::PhiloxRandom, float*, long long, tensorflow::random::TruncatedNormalDistribution<tensorflow::random::SingleSampleAdapter<tensorflow::random::PhiloxRandom>, float>)::{lambda(cl::sycl::handler&)#1}>(tensorflow::functor::FillPhiloxRandom<Eigen::SyclDevice, tensorflow::random::TruncatedNormalDistribution<tensorflow::random::SingleSampleAdapter<tensorflow::random::PhiloxRandom>, float> >::operator()(tensorflow::OpKernelContext*, Eigen::SyclDevice const&, tensorflow::random::PhiloxRandom, float*, long long, tensorflow::random::TruncatedNormalDistribution<tensorflow::random::SingleSampleAdapter<tensorflow::random::PhiloxRandom>, float>)::{lambda(cl::sycl::handler&)#1}, std::shared_ptr<cl::sycl::detail::queue> const&, cl::sycl::detail::standard_handler_tag) ()
from /usr/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#27 0x00007fffee905513 in cl::sycl::event cl::sycl::queue::submit<tensorflow::functor::FillPhiloxRandom<Eigen::SyclDevice, tensorflow::random::TruncatedNormalDistribution<tensorflow::random::SingleSampleAdapter<tensorflow::random::PhiloxRandom>, float> >::operator()(tensorflow::OpKernelContext*, Eigen::SyclDevice const&, tensorflow::random::PhiloxRandom, float*, long long, tensorflow::random::TruncatedNormalDistribution<tensorflow::random::SingleSampleAdapter<tensorflow::random::PhiloxRandom>, float>)::{lambda(cl::sycl::handler&)#1}>(tensorflow::functor::FillPhiloxRandom<Eigen::SyclDevice, tensorflow::random::TruncatedNormalDistribution<tensorflow::random::SingleSampleAdapter<tensorflow::random::PhiloxRandom>, float> >::operator()(tensorflow::OpKernelContext*, Eigen::SyclDevice const&, tensorflow::random::PhiloxRandom, float*, long long, tensorflow::random::TruncatedNormalDistribution<tensorflow::random::SingleSampleAdapter<tensorflow::random::PhiloxRandom>, float>)::{lambda(cl::sycl::handler&)#1}) ()
from /usr/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#28 0x00007fffee905352 in tensorflow::functor::FillPhiloxRandom<Eigen::SyclDevice, tensorflow::random::TruncatedNormalDistribution<tensorflow::random::SingleSampleAdapter<tensorflow::random::PhiloxRandom>, float> >::operator()(tensorflow::OpKernelContext*, Eigen::SyclDevice const&, tensorflow::random::PhiloxRandom, float*, long long, tensorflow::random::TruncatedNormalDistribution<tensorflow::random::SingleSampleAdapter<tensorflow::random::PhiloxRandom>, float>) ()
from /usr/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#29 0x00007fffee90c1db in tensorflow::(anonymous namespace)::PhiloxRandomOp<Eigen::SyclDevice, tensorflow::random::TruncatedNormalDistribution<tensorflow::random::SingleSampleAdapter<tensorflow::random::PhiloxRandom>, float> >::Compute(tensorflow::OpKernelContext*)
() from /usr/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#30 0x00007fffe8646b91 in tensorflow::(anonymous namespace)::ExecutorState::Process(tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, long long) () from /usr/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so
#31 0x00007fffe8647938 in std::_Function_handler<void (), tensorflow::(anonymous namespace)::ExecutorState::ScheduleReady(tensorflow::gtl::InlinedVector<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, 8> const&, tensorflow::(anonymous namespace)::ExecutorState::TaggedNodeReadyQueue*)::$_1>::_M_invoke(std::_Any_data const&) ()
from /usr/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so
#32 0x00007fffe86a2c7c in Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int) ()
from /usr/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so
#33 0x00007fffe86a25cd in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
from /usr/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so
#34 0x00007fffe760d063 in std::execute_native_thread_routine (__p=0x555556730d30)
at /build/gcc/src/gcc/libstdc++-v3/src/c++11/thread.cc:80
#35 0x00007ffff7c00a9d in start_thread () from /usr/lib/libpthread.so.0
#36 0x00007ffff7b30a43 in clone () from /usr/lib/libc.so.6
Thanks for the report. It is weird that once again you experience an issue with building a kernel while I haven't seen that for a long time on the different devices we test. This kind of issue typically happens when some invalid code is compiled on the device but then the issue would appear for all devices. Do you know a previous version that doesn't have this particular issue? We still don't have a good way in TensorFlow to print the log coming from clGetProgramBuildInfo but I'll ping you when we have something!
Mhh nope, I only started to try this pretty late in my venturing... (and I'll have to see about downgrading, given how much I'm already hanging over my hacky downgrade of python)
Anyway, not like I have such those high pretenses for this. It's more of a "if it can run here, then more so others should be able"
Ok don't bother too much downgrading, the chances that it fixes the issue are really low. This old commit using TF 1.6 would probably work: https://github.com/Rbiessy/tensorflow/commit/2142bdbcbe5fa53ffe13411a654d3e854861b68a It is using Eigen for the random distributions which is faster but the properties of the distributions are really bad so we had to revert the changes. If it is a blocking issue for you and you think a not so good random distribution would be acceptable for you then I could make a branch with these changes that uses TF 1.9 (until we have a way to fix the issue).
Edit: When compiling do you see something like this?
warning: [Computecpp:CC0034]: Function ... is undefined but referenced on the device and the associated kernels may fail to build or execute at run time [-Wsycl-undef-func]
That would be the same of #249 then I guess.. Anyway no, I don't recall seeing that warning
Yes I think the 2 issues are the same. Ok, interesting.
Hi @mirh, I have some updates on your issues with clBuildProgram failing.
There is a way to get more information on the failure with the latest version of eigen_sycl and ComputeCpp 1.0.2 but it also requires a small change in the code.
The call to submit
here should be wrapped inside a try-catch like so:
try {
device.sycl_queue().submit( ... );
} catch (const cl::sycl::exception& e) {
std::cerr << e.what() << std::endl;
}
Would you be able to try that and see what your output is?
Unfortunately my humble laptop is pretty duper fried atm... And I'm not sure whenever (if even) I'll get it working again.
We found the issue, it was caused by a double being used in a function assumed to be for floats. The fix is on my dev branch for now: https://github.com/Rbiessy/tensorflow/commit/419a6947e1aa3b30e54567c9419f5ca1d6a5df27
I had spotted the commit.. I'll get back to testing once my replacement motherboard arrives from HK
Getting the very same stack of #249. Should I still attempt that try-catch?
(ie
python -m tensorflow.python.debug.examples.debug_mnist
)