pytorch / torchdistx

Torch Distributed Experimental
BSD 3-Clause "New" or "Revised" License
116 stars 31 forks source link

Unable to build torchdistx for PT 2.0 #73

Open Vatshank opened 1 year ago

Vatshank commented 1 year ago

Hi!

Describe the bug: I am trying to build torchdistx from source following the instructions in the readme. Basically, I am running -

pip install --upgrade -r requirements.txt -r use-cpu.txt

cmake -DTORCHDIST_INSTALL_STANDALONE=ON -B build
cmake --build build # <- This errors out

When running cmake --build build, I see the following error -

[ 12%] Building CXX object src/cc/torchdistx/CMakeFiles/torchdistx.dir/deferred_init.cc.o
[ 25%] Building CXX object src/cc/torchdistx/CMakeFiles/torchdistx.dir/fake.cc.o
[ 37%] Building CXX object src/cc/torchdistx/CMakeFiles/torchdistx.dir/stack_utils.cc.o
[ 50%] Linking CXX shared library libtorchdistx.so
[ 50%] Built target torchdistx
[ 62%] Building CXX object src/python/torchdistx/_C/CMakeFiles/torchdistx-py.dir/deferred_init.cc.o
/home/ubuntu/repos/torchdistx/src/python/torchdistx/_C/deferred_init.cc:24:14: error: ‘torch::TypeError’ has not been declared
 using torch::TypeError;
              ^~~~~~~~~
/home/ubuntu/repos/torchdistx/src/python/torchdistx/_C/deferred_init.cc: In function ‘pybind11::object torchdistx::python::{anonymous}::materializeVariable(const pybind11::object&)’:
/home/ubuntu/repos/torchdistx/src/python/torchdistx/_C/deferred_init.cc:64:11: error: ‘TypeError’ was not declared in this scope
     throw TypeError{"`var` has to be a `Variable`, but got `%s`.", Py_TYPE(naked_var)->tp_name};
           ^~~~~~~~~
/home/ubuntu/repos/torchdistx/src/python/torchdistx/_C/deferred_init.cc:64:11: note: suggested alternatives:
In file included from /opt/conda/envs/alpa/lib/python3.9/site-packages/torch/include/c10/core/Device.h:5:0,
                 from /opt/conda/envs/alpa/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:11,
                 from /opt/conda/envs/alpa/lib/python3.9/site-packages/torch/include/ATen/core/Tensor.h:3,
                 from /opt/conda/envs/alpa/lib/python3.9/site-packages/torch/include/ATen/Tensor.h:3,
                 from /home/ubuntu/repos/torchdistx/src/python/torchdistx/_C/deferred_init.cc:9:
/opt/conda/envs/alpa/lib/python3.9/site-packages/torch/include/c10/util/Exception.h:246:15: note:   ‘c10::TypeError’
 class C10_API TypeError : public Error {
               ^~~~~~~~~
/opt/conda/envs/alpa/lib/python3.9/site-packages/torch/include/c10/util/Exception.h:246:15: note:   ‘c10::TypeError’
/opt/conda/envs/alpa/lib/python3.9/site-packages/torch/include/c10/util/Exception.h:246:15: note:   ‘c10::TypeError’
make[2]: *** [src/python/torchdistx/_C/CMakeFiles/torchdistx-py.dir/build.make:76: src/python/torchdistx/_C/CMakeFiles/torchdistx-py.dir/deferred_init.cc.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:914: src/python/torchdistx/_C/CMakeFiles/torchdistx-py.dir/all] Error 2
make: *** [Makefile:146: all] Error 2

Please let me know if I am doing something silly here or if torchdistx is not meant to support newer versions of PT? (And if so, is there another way to use the deferred_init or fake_tensor APIs in PyTorch?).

Describe how to reproduce:

pip install --upgrade -r requirements.txt -r use-cpu.txt

cmake -DTORCHDIST_INSTALL_STANDALONE=ON -B build
cmake --build build # <- This errors out

Environment:

Additional context: The build works for PT 1.12 and PT 1.13 but not with PT 2.0. I am trying to get Alpa to work for PT2.0 and it uses torchdistx. Right now, Alpa works with PT1.12 and PT1.13 (with a minor change) but not PT2.0.

prajdabre commented 1 year ago

How did you get torchdistx to work with PT1.13? Would be really helpful. Thanks.