Any PyTorch RPC code(CPU and CUDA) crashes with Segmentation fault on 4+ GCP a2-megagpu-16g nodes

This is reproducible only on 4+ GCP a2-megagpu-16g nodes. The code works on 1,2 and 3 nodes, but crashes on 4 and 5. It works fine on any other GCP instances, on any other AWS instances.

code: https://gist.github.com/pbelevich/1049831dfe0de8ae3bfe047a1cd67ad4

core dump with symbols

#0  tensorpipe::(anonymous namespace)::loadFdsFromArray<0, 1, 2, 3, tensorpipe::Fd, tensorpipe::Fd, tensorpipe::Fd, tensorpipe::Fd> (array=0x10)
    at /home/jamesreed_fb_com/pytorch/third_party/tensorpipe/tensorpipe/common/socket.h:58
#1  tensorpipe::recvFromSocket<unsigned int, tensorpipe::Fd, tensorpipe::Fd, tensorpipe::Fd, tensorpipe::Fd> (socketFd=3676, 
    t1=@0x2b3b9d7e78d8: 24, t2=@0x2b3b9d7e78dc: 25, fds#0=..., fds#1=..., fds#2=..., fds#3=...)
    at /home/jamesreed_fb_com/pytorch/third_party/tensorpipe/tensorpipe/common/socket.h:171
#2  0x00002b3afecf6769 in tensorpipe::Socket::recvPayloadAndFds<unsigned int, tensorpipe::Fd, tensorpipe::Fd, tensorpipe::Fd, tensorpipe::Fd, false> (this=<optimized out>, t2=@0x2b3b9d7e78dc: 25, t1=@0x2b3b9d7e78d8: 24)
    at /home/jamesreed_fb_com/pytorch/third_party/tensorpipe/tensorpipe/common/socket.h:253
#3  tensorpipe::transport::shm::ConnectionImpl::handleEventInFromLoop (this=this@entry=0x2b3bf423ca90)
    at /home/jamesreed_fb_com/pytorch/third_party/tensorpipe/tensorpipe/transport/shm/connection_impl.cc:225
#4  0x00002b3afecf7368 in tensorpipe::transport::shm::ConnectionImpl::handleEventsFromLoop (this=0x2b3bf423ca90, events=1)
    at /home/jamesreed_fb_com/pytorch/third_party/tensorpipe/tensorpipe/transport/shm/connection_impl.cc:192
#5  0x00002b3afecf3105 in tensorpipe::EpollLoop::handleEpollEventsFromLoop (this=0x55e03d6f8258, 
    epollEvents=std::vector of length 47, capacity 64 = {...})
    at /home/jamesreed_fb_com/pytorch/third_party/tensorpipe/tensorpipe/common/epoll_loop.cc:193
#6  0x00002b3afecf326f in tensorpipe::EpollLoop::<lambda()>::operator() (__closure=0x2b3ba0000be0)
    at /home/jamesreed_fb_com/pytorch/third_party/tensorpipe/tensorpipe/common/epoll_loop.cc:168
#7  tensorpipe::DeferredExecutor::<lambda()>::operator() (__closure=0x2b3ba0000bd0)
    at /home/jamesreed_fb_com/pytorch/third_party/tensorpipe/tensorpipe/common/deferred_executor.h:67
#8  std::_Function_handler<void(), tensorpipe::DeferredExecutor::runInLoop(F&&) [with F = tensorpipe::EpollLoop::loop()::<lambda()>]::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/std_function.h:316
#9  0x00002b3afed08689 in std::function<void ()>::operator()() const (this=<optimized out>)
    at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/std_function.h:706
#10 tensorpipe::EventLoopDeferredExecutor::runDeferredFunctionsFromEventLoop (this=0x55e03d6f8098)
    at /home/jamesreed_fb_com/pytorch/third_party/tensorpipe/tensorpipe/common/deferred_executor.h:220
#11 tensorpipe::BusyPollingLoop::eventLoop (this=0x55e03d6f8098)
    at /home/jamesreed_fb_com/pytorch/third_party/tensorpipe/tensorpipe/common/busy_polling_loop.h:37
#12 0x00002b3afecf050f in tensorpipe::EventLoopDeferredExecutor::loop (this=0x55e03d6f8098, threadName=...)
    at /home/jamesreed_fb_com/pytorch/third_party/tensorpipe/tensorpipe/common/deferred_executor.h:230
#13 0x00002b3afecf02e1 in std::__invoke_impl<void, void (tensorpipe::EventLoopDeferredExecutor::*)(std::string), tensorpipe::EventLoopDeferredExecutor*, std::string> (__t=<optimized out>, __f=<optimized out>) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/invoke.h:73
#14 std::__invoke<void (tensorpipe::EventLoopDeferredExecutor::*)(std::string), tensorpipe::EventLoopDeferredExecutor*, std::string> (
    __fn=<optimized out>) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/invoke.h:95
#15 std::thread::_Invoker<std::tuple<void (tensorpipe::EventLoopDeferredExecutor::*)(std::string), tensorpipe::EventLoopDeferredExecutor*, std::string> >::_M_invoke<0ul, 1ul, 2ul> (this=<optimized out>) at /opt/rh/devtoolset-7/root/usr/include/c++/7/thread:234
#16 std::thread::_Invoker<std::tuple<void (tensorpipe::EventLoopDeferredExecutor::*)(std::string), tensorpipe::EventLoopDeferredExecutor*, std::string> >::operator() (this=<optimized out>) at /opt/rh/devtoolset-7/root/usr/include/c++/7/thread:243
#17 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tensorpipe::EventLoopDeferredExecutor::*)(std::string), tensorpipe::EventLoopDeferredExecutor*, std::string> > >::_M_run (this=<optimized out>) at /opt/rh/devtoolset-7/root/usr/include/c++/7/thread:186
#18 0x00002b3af9d0c73f in execute_native_thread_routine () from /home/jamesreed_fb_com/pytorch/torch/lib/libtorch.so
#19 0x00002b3abfcadea5 in start_thread () from /lib64/libpthread.so.0
#20 0x00002b3abffc09fd in clone () from /lib64/libc.so.6

@jamesr66a @kwen2501

pytorch / tensorpipe

Any PyTorch RPC code(CPU and CUDA) crashes with Segmentation fault on 4+ GCP a2-megagpu-16g nodes #448