Error: DeviceGuardImpl for cpu is not available (static linking PyTorch)

jainshobhit commented 5 years ago

I am getting this:

Error: unhandled exception: p ASSERT FAILED at /data/Storage/Development/nimtorch/aten/include/c10/impl/DeviceGuardImplInterface.h:130, please report a bug to PyTorch. DeviceGuardImpl for cpu is not available (getDeviceGuardImpl at /data/Storage/Development/nimtorch/aten/include/c10/impl/DeviceGuardImplInterface.h:130)

I wonder if there is some forgotten initialization code that is not getting called when linked statically? Static library build not being tested and broken?

Previous content

Hi I followed the steps given in the following tutorial https://pytorch.org/tutorials/advanced/cpp_export.html to port my pytorch code to C++ and have been successfully able to run the model in C++. I compiled my C++ code as mentioned in the tutorial using the CMakeLists.txt file. However, I want to now run my code on a different platform(having similar configuration) and thus want to compile the code for static linking of the libraries. I tried using the following command for building the application: cmake -D CMAKE_PREFIX_PATH=/home/pytorch/libtorch -D OpenCV_DIR=/home/opencv/build -D BUILD_SHARED_LIBS=OFF -DCMAKE_EXE_LINKER_FLAGS="-static" .. which gave the following error: /usr/bin/ld: attempted static link of dynamic object /home/pytorch/libtorch/lib/libtorch.so

Can anyone help me by explaining what should be the best way for doing static linking of the library?

ezyang commented 5 years ago

This sounds like our cmake static linking support is not working at the moment.

sinkingsugar commented 5 years ago

Probably unrelated since I'm building in "aten only" mode, everything looks fine got all static libraries cuda versions included but when I try to run the executable (linked successfully), I am getting this:

Error: unhandled exception: p ASSERT FAILED at /data/Storage/Development/nimtorch/aten/include/c10/impl/DeviceGuardImplInterface.h:130, please report a bug to PyTorch. DeviceGuardImpl for cpu is not available (getDeviceGuardImpl at /data/Storage/Development/nimtorch/aten/include/c10/impl/DeviceGuardImplInterface.h:130)

I wonder if there is some forgotten initialization code that is not getting called when linked statically? Static library build not being tested and broken?

Sadly this project I'm working on really needs static linking, yet I'm quite blocked at the moment.

sinkingsugar commented 5 years ago

As I was thinking I needed to call: C10_REGISTER_GUARD_IMPL(CPU, CPUGuardImpl); And so on, static libraries and global variables initializations are a bit stochastic :)

ezyang commented 5 years ago

@sinkingsugar Err, did you do that and did that make it work?

I think my best suggestion for you is to link with -Wl,--whole-archive which will prevent the CPU initialization code from getting pruned away. We should make this easier, but that's the immediate way to solve your post-linking runtime problem.

sinkingsugar commented 5 years ago

Worked, had to add a few more registrations, cuda and as well Cuda Hooks. It's kinda expected for C++ and globals. I wanted to avoid whole-archive cos it would become a huge binary :) for my usage it is fine btw. Mostly because it's limited in scope.

Had a few issues with MKL lapack and Cuda lapack clashing tho btw! (seems like they cannot be together sadly when statically linked) when using magma.

ezyang commented 5 years ago

I think it would be reasonable for us to provide some initialization functions for static library users, so they can say what bits they want and those initializers will be guaranteed to get run. We have all sorts of random static initialization registries running around, so you may find some other things accidentally got pruned; fortunately it will be easy to tell when that happened.

We have some notes about how to statically link CUDA in our cmake files, if that helps at all.

LearnedLately commented 5 years ago

I have a similar issue trying to link statically. I was able to create a static library using add_library(as_net STATIC myfile.cpp) in CMakeLists.txt, but when I link to it in another application (which doesn't use cmake) I get hundreds of undefined reference errors. Has anyone found a solution to this? Using -Wl,--whole-archive gave 'multiple definition' errors when linking.

sinkingsugar commented 5 years ago

You don't wanna touch cmakelist.txt just need to pass -DBUILD_SHARED_LIBS=OFF

for example/reference: https://github.com/fragcolor-xyz/nimtorch/blob/50fcdd08aa41fae3b527340703c48a9a96630ace/.travis.yml#L79

LearnedLately commented 5 years ago

@sinkingsugar Thanks, I tried that, but still ended up with undefined references. I didn't see a CMakesLists.txt in your repository, but you must have one to run cmake, right? I don't see how to create a static library without add_library(myproject STATIC myfile.cpp). The -DBUILD_SHARED_LIBS=OFF flag alone isn't enough.

sinkingsugar commented 5 years ago

Well I also build with -DBUILD_ATEN_ONLY=ON with it off I have no idea :smile: might be more complex.

chopwoodwater commented 5 years ago

@jainshobhit @ezyang Have you solved the problem yet?

ezyang commented 5 years ago

As the saga in #20742 suggests, no, this does not seem to be fixed.

chopwoodwater commented 5 years ago

Worked, had to add a few more registrations, cuda and as well Cuda Hooks. It's kinda expected for C++ and globals. I wanted to avoid whole-archive cos it would become a huge binary :) for my usage it is fine btw. Mostly because it's limited in scope.

Had a few issues with MKL lapack and Cuda lapack clashing tho btw! (seems like they cannot be together sadly when statically linked) when using magma.

@sinkingsugar Could you please share your code of registrations here? I am facing the same issue of CPU initialization code getting pruned away, and cannot use the --whole-archive option b/c of the circumstance limitation. Thanks.

@ezyang @soumith Below is what I did to register the Device to fix the error"DeviceGuardImpl for cpu is not available". The executable can be successfully built. However, it fails to run correctly.

Below are two lines I added to register the Device.

#include <torch/script.h> // One-stop header.
#include <ATen/detail/CPUGuardImpl.h>  //ADDED first line

#include <iostream>
#include <memory>

int main(int argc, const char* argv[]) {
  if (argc != 2) {
    std::cerr << "usage: example-app <path-to-exported-script-module>\n";
    return -1;
  }

  C10_REGISTER_GUARD_IMPL(CPU, at::detail::CPUGuardImpl); //ADDED second line

  // Deserialize the ScriptModule from a file using torch::jit::load().
  std::shared_ptr<torch::jit::script::Module> module = torch::jit::load(argv[1]);

  assert(module != nullptr);
  std::cout << "ok\n";
}

Below is the error message:

terminate called after throwing an instance of 'torch::jit::script::ErrorReport'
  what():  
unknown builtin op: aten::mul
Could not find any similar ops to aten::mul. This op may not exist or may not be currently supported in TorchScript
:

def mul(a : float, b : Tensor) -> Tensor:
  return b * a
         ~~~~~ <--- HERE
def add(a : float, b : Tensor) -> Tensor:
  return b + a
def ne(a : float, b : Tensor) -> Tensor:
  return b != a
def eq(a : float, b : Tensor) -> Tensor:
  return b == a
def lt(a : float, b : Tensor) -> Tensor:
  return b > a
def le(a : float, b : Tensor) -> Tensor:
Aborted

ezyang commented 5 years ago

I am actually quite curious how @sinkingsugar got it to work, because I have been trying to get this to work myself and there is SOOOO much broken stuff lol.

ezyang commented 5 years ago

@LearnedLately @marchss I have gotten the static linking to successfully work using -Wl,--whole-archive. I'll put up complete instructions in #21737

RuABraun commented 5 years ago

Same error as @marchss

Looking at the suggestion in 21737 now.

gemfield commented 3 years ago

If you got "PyTorch is not linked with support for cuda devices”, and is using libtorch static library, that is because the cuda registration (C10_REGISTER_GUARD_IMPL(CUDA, CUDAGuardImpl)) is in compile unit CUDAGuardImpl.cpp.o, which will be ar to libc10_cuda.a. make sure use -Wl,--whole-archive on libc10_cuda.a during link phase.

gemfield commented 3 years ago

We have provided a project called libdeepvac, which can work either with shared libtorch or static libtorch library. Check https://github.com/DeepVAC/libdeepvac/blob/master/CMakeLists.txt if you are intresting on building against static pytorch.

VariantXYZ commented 2 years ago

@ezyang ,

I know this is a 2 year old issue, but is there any update on this? Passing whole-archive is a bit of a 'everything and the kitchen sink' solution that isn't really feasible for a lot of cases (just try enabling LTO and seeing how long it takes to work with whole-archive, if it even works in all cases).

Having a list of things to call to make sure initializers are referenced would be perfect, then people could even reduce from that based on what they specifically need.

At any rate, any solution that isn't just whole-archive would be great, I've been trying to mess around with Thin-LTO and the build times are horrendous.

Edit:

Specifically, the issues I'm running into are related to the schemas not being initialized (aten::empty.memory_format can't be found by Dispatcher, and so it throws).

ezyang commented 2 years ago

There's no update. I think I'd be amenable to a patchset that maintains the list of initializers. We'd need one per object file that has static initializer in it, but there aren't that many, so I think it would be feasible to maintain by hand. Do you think you'd be able to roll up your sleeves and help? I'd be able to advise the most obvious spots that have to get loaded.

VariantXYZ commented 2 years ago

Sure, what exactly did you have in mind? Just a raw list?

To start, I see all the TORCH_LIBRARY_IMPL stuff, but I’ve been trying to pin down what exactly sets up the dispatcher table for running scripted models.

ezyang commented 2 years ago

I was thinking we'd stick special "keep alive" symbols in each file with a regular naming convention, and then have a single mega function that calls all of them for the library.

For TS registry, you probably want the torch/csrc/jit/runtime/register_ files

VariantXYZ commented 2 years ago

@ezyang ,

Sorry for the delay on this, I've started working on it and there's a few things I've noticed.

In a simple application that loads TorchScript, there is a need to reload the schema, as literally nothing seems to exist when linking and enabling dead-stripping. To that end, I've had to create a separate object that I mark as --whole-archive that does registration to call TORCH_LIBRARY for 'aten' (based on the built RegisterSchema.cpp). I'm not sure if there's a better way to do this, but this is still preferable since it effectively is just including only the static constructors necessary instead of all of libtorch.

However, doing this also means I need to redeclare implementations with TORCH_LIBRARY_IMPL, as nothing in BackendSelect seems to stick, and there does not seem to be a convenient way to do this. I've been copying implementation details for wrappers RegisterCPU or directly pulling them from at::native.

After all this is done, it will still complain about not having aten::mul or aten::add, etc... even if they are defined, which is because it is necessary to call ensure_c10_registerer_defined(). This is easy enough.

However, the next hurdle I hit is that it can't seem to find 'aten::len' which is requested in the builtin jit functions. I'm not sure how to fix this one. I can't seem to find any implementations of it, and more importantly it's operating on two lists.

Edit:

aten::len is a part of register_prim_ops.cpp, missed this... but it's defined in an anonymous namespace, I'm not really sure how we're "expected" to access this.

VariantXYZ commented 2 years ago

I was thinking we'd stick special "keep alive" symbols in each file with a regular naming convention, and then have a single mega function that calls all of them for the library.

Maybe something even as simple as letting the user take a pre-generated source file and including it in their own project as an object that doesn't get stripped? It would also have the benefit of letting them simply define a custom set of operators for their own tasks too.

VariantXYZ commented 2 years ago

As a note, it should also just be possible to reference TORCH_LIBRARY_impl… functions directly, but this is dependent on namespace (aten is fine). The uid on TORCH_LIBRARY_IMPL also puts a hamper on this…

ezyang commented 2 years ago

Yeah, I think the idea is to get rid of the uids, so that the names can be referenced.

codesniffer13 commented 2 years ago

Specifically, the issues I'm running into are related to the schemas not being initialized (aten::empty.memory_format can't be found by Dispatcher, and so it throws).

I'm running into this same issue. Have you got it working @VariantXYZ ? Any chance there's a simple solution?

VariantXYZ commented 2 years ago

@codesniffer13 sorry, I have been busy and haven’t had much chance to look into doing this cleanly…

I ended up solving these problems by creating a small sub library without stripping (use-whole-archive) that defined its own dispatcher that actually just referenced the functions all manually from the pytorch headers (one by one, registering functions as my program complained they were missing). It was tedious and not very portable.

I based a lot of it on what’s in this file:

https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/templates/RegisterBackendSelect.cpp

(This is just the template file, look for it in your output build directory)

Unfortunately I haven’t had much time to look into a cleaner solution (c10 ops just have a neat little register function, maybe doing this per platform backend is the way to go?)

codesniffer13 commented 2 years ago

Thanks for the quick reply @VariantXYZ . Is your solution sharable?

I was thinking about modifying the ops files (ex. register_c10_ops.cpp) and change the anonymous namespace to be unique and duplicate the code to trigger the registration (ie. C10_UNUSED Registerer& dummy = registerer(); ) into a source file in my project (maybe needing an extern/etc). Thoughts?

VariantXYZ commented 2 years ago

@codesniffer13 ,

Is your solution sharable?

Currently no, I'm replying from devices that don't have access to the code. It's not too hard to explain though.

I was thinking about modifying the ops files (ex. register_c10_ops.cpp) and change the anonymous namespace to be unique and duplicate the code to trigger the registration (ie. C10_UNUSED Registerer& dummy = registerer(); ) into a source file in my project (maybe needing an extern/etc). Thoughts?

For c10 specifically, you should just be able to call https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/runtime/register_c10_ops.cpp#L61 to make sure the toolchain doesn't get rid of the c10 ops registerer.

However, I had to do what you mentioned for prim:

https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/runtime/register_prim_ops.cpp https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/mobile/prim_ops_registery.cpp

I didn't need to change the namespace though, but it might change based on how your project is organized.

Depending on your use-case, you might just be able to copy:

https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/templates/RegisterBackendSelect.cpp (from the outputs folder).

You can theoretically slim this down to only the operators you need as well, and do direct calls if you know your exact backend (e.g. if it's CPU, you can find the CPU native functions directly).

codesniffer13 commented 2 years ago

Thanks @VariantXYZ this is helpful.

For c10 specifically, you should just be able to call https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/runtime/register_c10_ops.cpp#L61 to make sure the toolchain doesn't get rid of the c10 ops registerer.

Yep, that's exactly what I ended up doing.

Depending on your use-case, you might just be able to copy: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/templates/RegisterBackendSelect.cpp (from the outputs folder). You can theoretically slim this down to only the operators you need as well, and do direct calls if you know your exact backend (e.g. if it's CPU, you can find the CPU native functions directly).

This file is helpful. I'd rather not register everything unnecessarily, so I'm trying to copy only the pieces I need (I'm CPU only for example).

I copied over the empty.memory_format:.

    C10_ALWAYS_INLINE
    at::Tensor empty_memory_format( <snip>

    aten, BackendSelect, m) {
        m.impl("aten::empty.memory_format", TORCH_FN(empty_memory_format));
    };

This gets me one step further, seems I have one last hurdle before I can rinse & repeat:

    terminate called after throwing an instance of 'c10::Error'
    what():  Could not find schema for aten::empty.memory_format but we found an implementation; did you forget to def() the operator?
    Exception raised from findSchemaOrThrow at ../aten/src/ATen/core/dispatch/Dispatcher.cpp:84 (most recent call first):

Looks like the op got registered but it has no schema. I'm trying to figure out how that part is handled. Do you recall?

VariantXYZ commented 2 years ago

RegisterSchema.cpp off the top of my head.

pytorch / pytorch

Error: DeviceGuardImpl for cpu is not available (static linking PyTorch) #14367

Previous content