Closed xysmlx closed 3 years ago
π Bug
This problem happens in both PyTorch 1.6.0 and PyTorch 1.9.0.
[W shape_type_inference.cpp:419] Warning: Constant folding in symbolic shape inference fails: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking arugment for argument index in method wrapper_index_select) Exception raised from common_device_check_failure at /pytorch/aten/src/ATen/core/adaption.cpp:10 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fdc94ae8a22 in anaconda3/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7fdc94ae53db in anaconda3/lib/python3.6/site-packages/torch/lib/libc10.so) frame #2: c10::impl::common_device_check_failure(c10::optional<c10::Device>&, at::Tensor const&, char const*, char const*) + 0x37e (0x7fdb9f722e9e in anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so) frame #3: <unknown function> + 0xf0e14b (0x7fdb5d21d14b in anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so) frame #4: <unknown function> + 0xf0e1d2 (0x7fdb5d21d1d2 in anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so) frame #5: at::redispatch::index_select(c10::DispatchKeySet, at::Tensor const&, long, at::Tensor const&) + 0xb4 (0x7fdba0095714 in anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so) frame #6: <unknown function> + 0x3010461 (0x7fdba1823461 in anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so) frame #7: <unknown function> + 0x30108b5 (0x7fdba18238b5 in anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so) frame #8: at::index_select(at::Tensor const&, long, at::Tensor const&) + 0x14e (0x7fdb9feb450e in anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so) frame #9: torch::jit::onnx_constant_fold::runTorchBackendForOnnx(torch::jit::Node const*, std::vector<at::Tensor, std::allocator<at::Tensor> >&, int) + 0x1b50 (0x7fdbb2666930 in anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #10: <unknown function> + 0xafcb51 (0x7fdbb26a3b51 in anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #11: torch::jit::ONNXShapeTypeInference(torch::jit::Node*, std::map<std::string, c10::IValue, std::less<std::string>, std::allocator<std::pair<std::string const, c10::IValue> > > const&, int) + 0x906 (0x7fdbb26a89c6 in anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #12: <unknown function> + 0xb04674 (0x7fdbb26ab674 in anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #13: <unknown function> + 0xa7b320 (0x7fdbb2622320 in anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #14: <unknown function> + 0x500c98 (0x7fdbb20a7c98 in anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so) <omitting python frames> (function ComputeConstantFolding) [ERROR] 2021-07-06T09:34:47z src/nnfusion/util/errors.hpp 169 Check failed: '_op->get_reduction_axes_count() == 1' at src/nnfusion/core/operators/generic_op/generic_op_define/Dot.cpp:50: (no explanation given) terminate called after throwing an instance of 'nnfusion::errors::CheckError' what(): Check failed: '_op->get_reduction_axes_count() == 1' at src/nnfusion/core/operators/generic_op/generic_op_define/Dot.cpp:50: (no explanation given) Aborted (core dumped) Traceback (most recent call last): File "example/bert.py", line 251, in <module> train_bert() File "example/bert.py", line 205, in train_bert nnf_loss = trainer(input_ids, attention_mask, labels) File "/src/python/nnfusion/trainer.py", line 75, in __call__ return self.run_by_nnf(*args) File "/src/python/nnfusion/trainer.py", line 88, in run_by_nnf outs = self.runner(*args) File "/src/python/nnfusion/runner.py", line 46, in __call__ return self.run_by_nnf(*args, **kwargs) File "/src/python/nnfusion/runner.py", line 71, in run_by_nnf return self._retrieve_by_desc(descs, device)(feeds) File "/src/python/nnfusion/runner.py", line 42, in _retrieve_by_desc **self._session_kwargs) File "/src/python/nnfusion/session.py", line 174, in __init__ self._create_executor() File "/src/python/nnfusion/session.py", line 192, in _create_executor codegen(self._onnx_model_path, flags_str, self._workdir) File "/src/python/nnfusion/session.py", line 75, in codegen execute(command) File "/src/python/nnfusion/utils.py", line 33, in execute raise e File "/src/python/nnfusion/utils.py", line 30, in execute **kwargs) File "anaconda3/lib/python3.6/subprocess.py", line 336, in check_output **kwargs).stdout File "anaconda3/lib/python3.6/subprocess.py", line 418, in run output=stdout, stderr=stderr) subprocess.CalledProcessError: Command 'nnfusion /tmp/nnf_djnq8yot/nnf.onnx -f onnx -fextern_result_memory=True -fautodiff=True -ftraining_mode=True -ftraining_optimizer='{"optimizer": "SGD", "learning_rate": 0.0001}' -fblockfusion_level=0 -fenable_all_bert_fusion=True' returned non-zero exit status 134.
Thanks for the report @xysmlx! I will look into it ASAP! (I'm a bot).
π Bug
This problem happens in both PyTorch 1.6.0 and PyTorch 1.9.0.