Open pbelevich opened 2 years ago
This is reproducible only on 4+ GCP a2-megagpu-16g nodes. The code works on 1,2 and 3 nodes, but crashes on 4 and 5. It works fine on any other GCP instances, on any other AWS instances.
code: https://gist.github.com/pbelevich/1049831dfe0de8ae3bfe047a1cd67ad4
core dump with symbols
#0 tensorpipe::(anonymous namespace)::loadFdsFromArray<0, 1, 2, 3, tensorpipe::Fd, tensorpipe::Fd, tensorpipe::Fd, tensorpipe::Fd> (array=0x10) at /home/jamesreed_fb_com/pytorch/third_party/tensorpipe/tensorpipe/common/socket.h:58 #1 tensorpipe::recvFromSocket<unsigned int, tensorpipe::Fd, tensorpipe::Fd, tensorpipe::Fd, tensorpipe::Fd> (socketFd=3676, t1=@0x2b3b9d7e78d8: 24, t2=@0x2b3b9d7e78dc: 25, fds#0=..., fds#1=..., fds#2=..., fds#3=...) at /home/jamesreed_fb_com/pytorch/third_party/tensorpipe/tensorpipe/common/socket.h:171 #2 0x00002b3afecf6769 in tensorpipe::Socket::recvPayloadAndFds<unsigned int, tensorpipe::Fd, tensorpipe::Fd, tensorpipe::Fd, tensorpipe::Fd, false> (this=<optimized out>, t2=@0x2b3b9d7e78dc: 25, t1=@0x2b3b9d7e78d8: 24) at /home/jamesreed_fb_com/pytorch/third_party/tensorpipe/tensorpipe/common/socket.h:253 #3 tensorpipe::transport::shm::ConnectionImpl::handleEventInFromLoop (this=this@entry=0x2b3bf423ca90) at /home/jamesreed_fb_com/pytorch/third_party/tensorpipe/tensorpipe/transport/shm/connection_impl.cc:225 #4 0x00002b3afecf7368 in tensorpipe::transport::shm::ConnectionImpl::handleEventsFromLoop (this=0x2b3bf423ca90, events=1) at /home/jamesreed_fb_com/pytorch/third_party/tensorpipe/tensorpipe/transport/shm/connection_impl.cc:192 #5 0x00002b3afecf3105 in tensorpipe::EpollLoop::handleEpollEventsFromLoop (this=0x55e03d6f8258, epollEvents=std::vector of length 47, capacity 64 = {...}) at /home/jamesreed_fb_com/pytorch/third_party/tensorpipe/tensorpipe/common/epoll_loop.cc:193 #6 0x00002b3afecf326f in tensorpipe::EpollLoop::<lambda()>::operator() (__closure=0x2b3ba0000be0) at /home/jamesreed_fb_com/pytorch/third_party/tensorpipe/tensorpipe/common/epoll_loop.cc:168 #7 tensorpipe::DeferredExecutor::<lambda()>::operator() (__closure=0x2b3ba0000bd0) at /home/jamesreed_fb_com/pytorch/third_party/tensorpipe/tensorpipe/common/deferred_executor.h:67 #8 std::_Function_handler<void(), tensorpipe::DeferredExecutor::runInLoop(F&&) [with F = tensorpipe::EpollLoop::loop()::<lambda()>]::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/std_function.h:316 #9 0x00002b3afed08689 in std::function<void ()>::operator()() const (this=<optimized out>) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/std_function.h:706 #10 tensorpipe::EventLoopDeferredExecutor::runDeferredFunctionsFromEventLoop (this=0x55e03d6f8098) at /home/jamesreed_fb_com/pytorch/third_party/tensorpipe/tensorpipe/common/deferred_executor.h:220 #11 tensorpipe::BusyPollingLoop::eventLoop (this=0x55e03d6f8098) at /home/jamesreed_fb_com/pytorch/third_party/tensorpipe/tensorpipe/common/busy_polling_loop.h:37 #12 0x00002b3afecf050f in tensorpipe::EventLoopDeferredExecutor::loop (this=0x55e03d6f8098, threadName=...) at /home/jamesreed_fb_com/pytorch/third_party/tensorpipe/tensorpipe/common/deferred_executor.h:230 #13 0x00002b3afecf02e1 in std::__invoke_impl<void, void (tensorpipe::EventLoopDeferredExecutor::*)(std::string), tensorpipe::EventLoopDeferredExecutor*, std::string> (__t=<optimized out>, __f=<optimized out>) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/invoke.h:73 #14 std::__invoke<void (tensorpipe::EventLoopDeferredExecutor::*)(std::string), tensorpipe::EventLoopDeferredExecutor*, std::string> ( __fn=<optimized out>) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/invoke.h:95 #15 std::thread::_Invoker<std::tuple<void (tensorpipe::EventLoopDeferredExecutor::*)(std::string), tensorpipe::EventLoopDeferredExecutor*, std::string> >::_M_invoke<0ul, 1ul, 2ul> (this=<optimized out>) at /opt/rh/devtoolset-7/root/usr/include/c++/7/thread:234 #16 std::thread::_Invoker<std::tuple<void (tensorpipe::EventLoopDeferredExecutor::*)(std::string), tensorpipe::EventLoopDeferredExecutor*, std::string> >::operator() (this=<optimized out>) at /opt/rh/devtoolset-7/root/usr/include/c++/7/thread:243 #17 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tensorpipe::EventLoopDeferredExecutor::*)(std::string), tensorpipe::EventLoopDeferredExecutor*, std::string> > >::_M_run (this=<optimized out>) at /opt/rh/devtoolset-7/root/usr/include/c++/7/thread:186 #18 0x00002b3af9d0c73f in execute_native_thread_routine () from /home/jamesreed_fb_com/pytorch/torch/lib/libtorch.so #19 0x00002b3abfcadea5 in start_thread () from /lib64/libpthread.so.0 #20 0x00002b3abffc09fd in clone () from /lib64/libc.so.6
@jamesr66a @kwen2501
This is reproducible only on 4+ GCP a2-megagpu-16g nodes. The code works on 1,2 and 3 nodes, but crashes on 4 and 5. It works fine on any other GCP instances, on any other AWS instances.
code: https://gist.github.com/pbelevich/1049831dfe0de8ae3bfe047a1cd67ad4
core dump with symbols
@jamesr66a @kwen2501