pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
82.21k stars 22.11k forks source link

Pytorch C++ API with cuda : Expected object of backend CPU but got backend CUDA for sequence element 1 in sequence argument at position #1 'tensors' #14270

Open IcewineChen opened 5 years ago

IcewineChen commented 5 years ago

πŸ› Bug

Thanks for your teams' great work! But during using the C++ API of pytorch on gpu, there are some confusing bugs. When I try to load a .pt file as module and then do a forward operator, I got an exception.

To Reproduce

Here are my code and exception, the .pt file is generated by torch.jit.trace(model, example).cuda()

my code:

std::shared_ptr<torch::jit::script::Module> module = torch::jit::load("my_model_path.pt"); 
module->to(torch::kCUDA);
std::vector<torch::jit::IValue> inputs;
inputs.push_back(torch::ones({model_input_size}).cuda())
auto output = module->forward(inputs).toTensor();

the state of variable: I'have checked the tensor that be pushed to inputs vector is Variable[CUDAFloatType], and the model.pt is generated on cuda.

Exception: 
terminate called after throwing an instance of 'c10::Error'
  what():  Expected object of backend CPU but got backend CUDA for sequence element 1 in sequence argument at position #1 'tensors' (checked_tensor_list_unwrap at /pytorch/aten/src/ATen/Utils.h:87)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7fd638cc1ba1 in /home/chr/action-sdk/libs/libtorch-latest-gpu/libtorch-gpu/libtorch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7fd638cc146a in /home/chr/action-sdk/libs/libtorch-latest-gpu/libtorch-gpu/libtorch/lib/libc10.so)
frame #2: <unknown function> + 0x7b22d8 (0x7fd65adf72d8 in /home/chr/action-sdk/libs/libtorch-latest-gpu/libtorch-gpu/libtorch/lib/libcaffe2.so)
frame #3: <unknown function> + 0x7b29bc (0x7fd65adf79bc in /home/chr/action-sdk/libs/libtorch-latest-gpu/libtorch-gpu/libtorch/lib/libcaffe2.so)
frame #4: at::native::cat(c10::ArrayRef<at::Tensor>, long) + 0xa4 (0x7fd65ad09624 in /home/chr/action-sdk/libs/libtorch-latest-gpu/libtorch-gpu/libtorch/lib/libcaffe2.so)
frame #5: at::TypeDefault::cat(c10::ArrayRef<at::Tensor>, long) const + 0x4f (0x7fd65aed7cff in /home/chr/action-sdk/libs/libtorch-latest-gpu/libtorch-gpu/libtorch/lib/libcaffe2.so)
frame #6: torch::autograd::VariableType::cat(c10::ArrayRef<at::Tensor>, long) const + 0x1bc (0x7fd6699d5cdc in /home/chr/action-sdk/libs/libtorch-latest-gpu/libtorch-gpu/libtorch/lib/libtorch.so.1)
frame #7: <unknown function> + 0x52b2a8 (0x7fd669b022a8 in /home/chr/action-sdk/libs/libtorch-latest-gpu/libtorch-gpu/libtorch/lib/libtorch.so.1)
frame #8: torch::jit::ConstantPropagation(torch::jit::Node*, bool) + 0x450 (0x7fd669c06c30 in /home/chr/action-sdk/libs/libtorch-latest-gpu/libtorch-gpu/libtorch/lib/libtorch.so.1)
frame #9: torch::jit::ConstantPropagation(torch::jit::Block*, bool) + 0x44 (0x7fd669c07a14 in /home/chr/action-sdk/libs/libtorch-latest-gpu/libtorch-gpu/libtorch/lib/libtorch.so.1)
frame #10: torch::jit::ConstantPropagation(std::shared_ptr<torch::jit::Graph>&) + 0x18 (0x7fd669c07b18 in /home/chr/action-sdk/libs/libtorch-latest-gpu/libtorch-gpu/libtorch/lib/libtorch.so.1)
frame #11: <unknown function> + 0x5d39a0 (0x7fd669baa9a0 in /home/chr/action-sdk/libs/libtorch-latest-gpu/libtorch-gpu/libtorch/lib/libtorch.so.1)
frame #12: torch::jit::GraphExecutor::run(std::vector<c10::IValue, std::allocator<c10::IValue> >&) + 0x19d (0x7fd669bab2cd in /home/chr/action-sdk/libs/libtorch-latest-gpu/libtorch-gpu/libtorch/lib/libtorch.so.1)
frame #13: torch::jit::script::Method::run(std::vector<c10::IValue, std::allocator<c10::IValue> >&) + 0xb4 (0x456096 in ./action)
frame #14: torch::jit::script::Method::operator()(std::vector<c10::IValue, std::allocator<c10::IValue> >) + 0x4a (0x4560f8 in ./action)
frame #15: torch::jit::script::Module::forward(std::vector<c10::IValue, std::allocator<c10::IValue> >) + 0x81 (0x456f8f in ./action)
frame #16: main + 0x550 (0x4538df in ./action)
frame #17: __libc_start_main + 0xf0 (0x7fd6380b2830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #18: _start + 0x29 (0x43e0f9 in ./action)

[1]    35218 abort (core dumped)

I have read the source code and find that the construct function of module->forward() only accept vector type. But the vector type can't match the check by the tensor lib. Could you give me some advice and help about how to change the vector type match the check by ATen? Thank you very much.

Environment

cindycia commented 5 years ago

I have the same issue. Have you solve this problem?

uvaidya commented 5 years ago

I am also seeing a similar issue for one of my experiments. Can you give the pointers on how you overcame it @IcewineChen .

Regards

soumith commented 5 years ago

@uvaidya @IcewineChen can you check if this still reproduces on 1.0.0 stable, or on pytorch nightly build? I believe we fixed this now.

IcewineChen commented 5 years ago

@uvaidya @IcewineChen can you check if this still reproduces on 1.0.0 stable, or on pytorch nightly build? I believe we fixed this now.

@soumith Sorry for interrupting you. I have tested, but the program still get an error . I'm sure I use cuda() to move the tensor to gpu device, and the module has been moved to gpu mode, too. But when I use pytorch-nightly dev11.28 version and pytorch1.0 stable, the error looks like that: terminate called after throwing an instance of 'c10::Error' what(): expected type CUDAFloatType but got CPUFloatType (compute_types at/pytorch/aten/src/ATen/native/TensorIterator.cpp:134) frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7ff6e60f5d31 in /home/chr/action-sdk/libs/libtorch/lib/libc10.so) frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7ff6e60f55fa in /home/chr/action-sdk/libs/libtorch/lib/libc10.so) frame #2: at::TensorIterator::compute_types() + 0x3b5 (0x7ff7081aa055 in /home/chr/action-sdk/libs/libtorch/lib/libcaffe2.so) frame #3: at::TensorIterator::Builder::build() + 0x46 (0x7ff7081abcc6 in /home/chr/action-sdk/libs/libtorch/lib/libcaffe2.so) frame #4: at::TensorIterator::binary_op(at::Tensor&, at::Tensor const&, at::Tensor const&) + 0x2c4 (0x7ff7081ac634 in /home/chr/action-sdk/libs/libtorch/lib/libcaffe2.so) frame #5: at::native::addout(at::Tensor&, at::Tensor const&, at::Tensor const&, c10::Scalar) + 0x77 (0x7ff708099717 in /home/chr/action-sdk/libs/libtorch/lib/libcaffe2.so) frame #6: at::TypeDefault::add(at::Tensor&, at::Tensor const&, c10::Scalar) const + 0x68 (0x7ff70839f198 in /home/chr/action-sdk/libs/libtorch/lib/libcaffe2.so) frame #7: torch::autograd::VariableType::add_(at::Tensor&, at::Tensor const&, c10::Scalar) const + 0x1d6 (0x7ff718e6b7d6 in /home/chr/action-sdk/libs/libtorch/lib/libtorch.so.1) frame #8: + 0x635867 (0x7ff719053867 in /home/chr/action-sdk/libs/libtorch/lib/libtorch.so.1) frame #9: + 0x6839f6 (0x7ff7190a19f6 in /home/chr/action-sdk/libs/libtorch/lib/libtorch.so.1) frame #10: torch::jit::InterpreterState::run(std::vector<c10::IValue, std::allocator >&) + 0x22 (0x7ff71909cad2 in /home/chr/action-sdk/libs/libtorch/lib/libtorch.so.1) frame #11: + 0x658a5c (0x7ff719076a5c in /home/chr/action-sdk/libs/libtorch/lib/libtorch.so.1) frame #12: torch::jit::script::Method::run(std::vector<c10::IValue, std::allocator >&) + 0xb4 (0x4689aa in ./action) frame #13: torch::jit::script::Method::operator()(std::vector<c10::IValue, std::allocator >) + 0x4a (0x468a0c in ./action) frame #14: torch::jit::script::Module::forward(std::vector<c10::IValue, std::allocator >) + 0x81 (0x4698a7 in ./action) frame #15: init::GpuInit(std::shared_ptr) + 0x182 (0x4920ad in ./action) frame #16: main + 0x4a4 (0x4931f3 in ./action) frame #17: __libc_start_main + 0xf0 (0x7ff6e50d8830 in /lib/x86_64-linux-gnu/libc.so.6) frame #18: _start + 0x29 (0x463109 in ./action)

[1] 15839 abort (core dumped) ./action ~/experiment/video-classification-3d-cnn-pytorch/resnet34-ucf101.pt

And there is my trace code writting by python:

example = torch.rand(size=(1, 3, 64, 112, 112))
traced_script_module = torch.jit.trace(model, example)
traced_script_module.save("resnet34-ucf101.pt")

Could you give me some advice?

Sierkinhane commented 5 years ago

i met the same problem, have you solved it?@IcewineChen

ygean commented 4 years ago

@Sierkinhane @IcewineChen Have you solved?