pytorch / tensorpipe

A tensor-aware point-to-point communication primitive for machine learning
Other
248 stars 77 forks source link

Sending empty(numel==0) cuda tensor results in SIGABRT crash #447

Open pbelevich opened 2 years ago

pbelevich commented 2 years ago
terminate called after throwing an instance of 'std::runtime_error'
  what():  In cudaDeviceForPointer at tensorpipe/common/cuda.h:162 "cudaLib.pointerGetAttribute( &deviceIdx, CU_POINTER_ATTRIBUTE_DEVICE_ORDINAL, reinterpret_cast<CUdeviceptr>(ptr))(1) CUDA_ERROR_INVALID_VALUE (invalid argument)"
SIGABRT(6), PID: 4036392, Thread 4036392:
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x6d (0x7f656941be9d in /home/pbelevich/local/anaconda3/envs/pippy_pt_dev/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10::FatalSignalHandler::fatalSignalHandler(int) + 0x15a (0x7f656941c28a in /home/pbelevich/local/anaconda3/envs/pippy_pt_dev/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x12c20 (0x7f65bd868c20 in /lib64/libpthread.so.0)
frame #3: gsignal + 0x10f (0x7f65bd4dfa4f in /lib64/libc.so.6)
frame #4: abort + 0x127 (0x7f65bd4b2db5 in /lib64/libc.so.6)
frame #5: <unknown function> + 0x9009b (0x7f658fde409b in /lib64/libstdc++.so.6)
frame #6: <unknown function> + 0x9653c (0x7f658fdea53c in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x95559 (0x7f658fde9559 in /lib64/libstdc++.so.6)
frame #8: __gxx_personality_v0 + 0x2a8 (0x7f658fde9ed8 in /lib64/libstdc++.so.6)
frame #9: <unknown function> + 0x10b03 (0x7f65900f9b03 in /lib64/libgcc_s.so.1)
frame #10: _Unwind_Resume + 0x12d (0x7f65900fa41d in /lib64/libgcc_s.so.1)
frame #11: <unknown function> + 0xde56d5 (0x7f6574caa6d5 in /home/pbelevich/local/anaconda3/envs/pippy_pt_dev/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #12: torch::distributed::rpc::TensorPipeAgent::pipeWrite(std::shared_ptr<tensorpipe::Pipe> const&, c10::intrusive_ptr<torch::distributed::rpc::Message, c10::detail::intrusive_target_default_null_type<torch::distributed::rpc::Message> >, std::vector<c10::Device, std::allocator<c10::Device> >&&, std::vector<c10::Stream, std::allocator<c10::Stream> >, std::function<void (tensorpipe::Error const&)>) + 0x954 (0x7f657781ed04 in /home/pbelevich/local/anaconda3/envs/pippy_pt_dev/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #13: torch::distributed::rpc::TensorPipeAgent::send(torch::distributed::rpc::WorkerInfo const&, c10::intrusive_ptr<torch::distributed::rpc::Message, c10::detail::intrusive_target_default_null_type<torch::distributed::rpc::Message> >, float, std::unordered_map<c10::Device, c10::Device, std::hash<c10::Device>, std::equal_to<c10::Device>, std::allocator<std::pair<c10::Device const, c10::Device> > > const&) + 0x161d (0x7f657782a25d in /home/pbelevich/local/anaconda3/envs/pippy_pt_dev/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
lw commented 2 years ago

Could you give a bit more context? Where was this issue encountered? What channel was being used? (unfortunately the stack trace is for where the exception was caught, not where it was thrown)

My guess is that PyTorch is passing an invalid pointer (likely nullptr) in this case. It's debatable whether TensorPipe should accept and handle a nullptr, or whether it's PyTorch responsibility to only pass valid pointers.