[BUG] Python bert training example broken

🐛 Bug
This problem happens in both PyTorch 1.6.0 and PyTorch 1.9.0.
[W shape_type_inference.cpp:419] Warning: Constant folding in symbolic shape inference fails: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking arugment for argument index in method wrapper_index_select)
Exception raised from common_device_check_failure at /pytorch/aten/src/ATen/core/adaption.cpp:10 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fdc94ae8a22 in anaconda3/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7fdc94ae53db in anaconda3/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #2: c10::impl::common_device_check_failure(c10::optional<c10::Device>&, at::Tensor const&, char const*, char const*) + 0x37e (0x7fdb9f722e9e in anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0xf0e14b (0x7fdb5d21d14b in anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xf0e1d2 (0x7fdb5d21d1d2 in anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #5: at::redispatch::index_select(c10::DispatchKeySet, at::Tensor const&, long, at::Tensor const&) + 0xb4 (0x7fdba0095714 in anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x3010461 (0x7fdba1823461 in anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0x30108b5 (0x7fdba18238b5 in anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #8: at::index_select(at::Tensor const&, long, at::Tensor const&) + 0x14e (0x7fdb9feb450e in anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #9: torch::jit::onnx_constant_fold::runTorchBackendForOnnx(torch::jit::Node const*, std::vector<at::Tensor, std::allocator<at::Tensor> >&, int) + 0x1b50 (0x7fdbb2666930 in anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0xafcb51 (0x7fdbb26a3b51 in anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #11: torch::jit::ONNXShapeTypeInference(torch::jit::Node*, std::map<std::string, c10::IValue, std::less<std::string>, std::allocator<std::pair<std::string const, c10::IValue> > > const&, int) + 0x906 (0x7fdbb26a89c6 in anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #12: <unknown function> + 0xb04674 (0x7fdbb26ab674 in anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #13: <unknown function> + 0xa7b320 (0x7fdbb2622320 in anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #14: <unknown function> + 0x500c98 (0x7fdbb20a7c98 in anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
 (function ComputeConstantFolding)

[ERROR] 2021-07-06T09:34:47z src/nnfusion/util/errors.hpp 169   Check failed: '_op->get_reduction_axes_count() == 1' at src/nnfusion/core/operators/generic_op/generic_op_define/Dot.cpp:50:
(no explanation given)
terminate called after throwing an instance of 'nnfusion::errors::CheckError'
  what():  Check failed: '_op->get_reduction_axes_count() == 1' at src/nnfusion/core/operators/generic_op/generic_op_define/Dot.cpp:50:
(no explanation given)
Aborted (core dumped)

Traceback (most recent call last):
  File "example/bert.py", line 251, in <module>
    train_bert()
  File "example/bert.py", line 205, in train_bert
    nnf_loss = trainer(input_ids, attention_mask, labels)
  File "/src/python/nnfusion/trainer.py", line 75, in __call__
    return self.run_by_nnf(*args)
  File "/src/python/nnfusion/trainer.py", line 88, in run_by_nnf
    outs = self.runner(*args)
  File "/src/python/nnfusion/runner.py", line 46, in __call__
    return self.run_by_nnf(*args, **kwargs)
  File "/src/python/nnfusion/runner.py", line 71, in run_by_nnf
    return self._retrieve_by_desc(descs, device)(feeds)
  File "/src/python/nnfusion/runner.py", line 42, in _retrieve_by_desc
    **self._session_kwargs)
  File "/src/python/nnfusion/session.py", line 174, in __init__
    self._create_executor()
  File "/src/python/nnfusion/session.py", line 192, in _create_executor
    codegen(self._onnx_model_path, flags_str, self._workdir)
  File "/src/python/nnfusion/session.py", line 75, in codegen
    execute(command)
  File "/src/python/nnfusion/utils.py", line 33, in execute
    raise e
  File "/src/python/nnfusion/utils.py", line 30, in execute
    **kwargs)
  File "anaconda3/lib/python3.6/subprocess.py", line 336, in check_output
    **kwargs).stdout
  File "anaconda3/lib/python3.6/subprocess.py", line 418, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'nnfusion /tmp/nnf_djnq8yot/nnf.onnx -f onnx -fextern_result_memory=True -fautodiff=True -ftraining_mode=True -ftraining_optimizer='{"optimizer": "SGD", "learning_rate": 0.0001}' -fblockfusion_level=0 -fenable_all_bert_fusion=True' returned non-zero exit status 134.
microsoft / nnfusion

[BUG] Python bert training example broken #290