Open cloudhan opened 3 years ago
Simply enable CUPTI will cause torch_cpu.dll
reference cudart symbols
> ninja.exe .\bin\torch_cpu.dll
[1/1] Linking CXX shared library bin\torch_cpu.dll
FAILED: bin/torch_cpu.dll lib/torch_cpu.lib
cmd.exe /C "cd . && "C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\Common7\IDE\CommonExtensions\Microsoft\CMake\CMake\bin\cmake.exe" -E vs_link_dll --intdir=caffe2\CMakeFiles\torch_cpu.dir --rc=C:\PROGRA~2\WI3CF2~1\10\bin\100190~1.0\x64\rc.exe --mt=C:\PROGRA~2\WI3CF2~1\10\bin\100190~1.0\x64\mt.exe --manifests -- C:\PROGRA~2\MICROS~3\2019\ENTERP~1\VC\Tools\MSVC\1429~1.300\bin\Hostx64\x64\link.exe /nologo @CMakeFiles\torch_cpu.rsp /out:bin\torch_cpu.dll /implib:lib\torch_cpu.lib /pdb:bin\torch_cpu.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /INCREMENTAL:NO -WHOLEARCHIVE:C:/Users/guangyunhan/workspaces/pytorch/build/lib/caffe2_protos.lib -WHOLEARCHIVE:C:/Users/guangyunhan/workspaces/pytorch/build/lib/onnx.lib -WHOLEARCHIVE:C:/Users/guangyunhan/workspaces/pytorch/build/lib/Caffe2_perfkernels_avx.lib -WHOLEARCHIVE:C:/Users/guangyunhan/workspaces/pytorch/build/lib/Caffe2_perfkernels_avx2.lib -WHOLEARCHIVE:C:/Users/guangyunhan/workspaces/pytorch/build/lib/Caffe2_perfkernels_avx512.lib && cd ."
LINK: command "C:\PROGRA~2\MICROS~3\2019\ENTERP~1\VC\Tools\MSVC\1429~1.300\bin\Hostx64\x64\link.exe /nologo @CMakeFiles\torch_cpu.rsp /out:bin\torch_cpu.dll /implib:lib\torch_cpu.lib /pdb:bin\torch_cpu.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /INCREMENTAL:NO -WHOLEARCHIVE:C:/Users/guangyunhan/workspaces/pytorch/build/lib/caffe2_protos.lib -WHOLEARCHIVE:C:/Users/guangyunhan/workspaces/pytorch/build/lib/onnx.lib -WHOLEARCHIVE:C:/Users/guangyunhan/workspaces/pytorch/build/lib/Caffe2_perfkernels_avx.lib -WHOLEARCHIVE:C:/Users/guangyunhan/workspaces/pytorch/build/lib/Caffe2_perfkernels_avx2.lib -WHOLEARCHIVE:C:/Users/guangyunhan/workspaces/pytorch/build/lib/Caffe2_perfkernels_avx512.lib /MANIFEST /MANIFESTFILE:bin\torch_cpu.dll.manifest" failed (exit code 1120) with the following output:
Creating library lib\torch_cpu.lib and object lib\torch_cpu.exp
kineto.lib(CudaDeviceProperties.cpp.obj) : error LNK2019: unresolved external symbol cudaGetDeviceCount referenced in function "class std::vector<struct cudaDeviceProp,class std::allocator<struct cudaDeviceProp> > const __cdecl libkineto::createDeviceProps(void)" (?createDeviceProps@libkineto@@YA?BV?$vector@UcudaDeviceProp@@V?$allocator@UcudaDeviceProp@@@std@@@std@@XZ)
kineto.lib(CudaDeviceProperties.cpp.obj) : error LNK2019: unresolved external symbol cudaGetDeviceProperties referenced in function "class std::vector<struct cudaDeviceProp,class std::allocator<struct cudaDeviceProp> > const __cdecl libkineto::createDeviceProps(void)" (?createDeviceProps@libkineto@@YA?BV?$vector@UcudaDeviceProp@@V?$allocator@UcudaDeviceProp@@@std@@@std@@XZ)
bin\torch_cpu.dll : fatal error LNK1120: 2 unresolved externals
ninja: build stopped: subcommand failed.
all two symbols are boils down to CudaDeviceProperties.cpp
. It it possible that we link cudart_static.lib
into torch_cpu.dll
statically and hide the symbol. But it might cause multiple instances of cudart objects living in PyTorch application, which is a big no no IMHO.
Simply moving kineto to Caffe2_CUDA_DEPENDENCY_LIBS
instead of Caffe2_DEPENDENCY_LIBS
causes following linker errors:
[2/2] Linking CXX shared library bin\torch_cpu.dll
FAILED: bin/torch_cpu.dll lib/torch_cpu.lib
cmd.exe /C "cd . && "C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\Common7\IDE\CommonExtensions\Microsoft\CMake\CMake\bin\cmake.exe" -E vs_link_dll --intdir=caffe2\CMakeFiles\torch_cpu.dir --rc=C:\PROGRA~2\WI3CF2~1\10\bin\100190~1.0\x64\rc.exe --mt=C:\PROGRA~2\WI3CF2~1\10\bin\100190~1.0\x64\mt.exe --manifests -- C:\PROGRA~2\MICROS~3\2019\ENTERP~1\VC\Tools\MSVC\1429~1.300\bin\Hostx64\x64\link.exe /nologo @CMakeFiles\torch_cpu.rsp /out:bin\torch_cpu.dll /implib:lib\torch_cpu.lib /pdb:bin\torch_cpu.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /INCREMENTAL:NO -WHOLEARCHIVE:C:/Users/guangyunhan/workspaces/pytorch/build/lib/caffe2_protos.lib -WHOLEARCHIVE:C:/Users/guangyunhan/workspaces/pytorch/build/lib/onnx.lib -WHOLEARCHIVE:C:/Users/guangyunhan/workspaces/pytorch/build/lib/Caffe2_perfkernels_avx.lib -WHOLEARCHIVE:C:/Users/guangyunhan/workspaces/pytorch/build/lib/Caffe2_perfkernels_avx2.lib -WHOLEARCHIVE:C:/Users/guangyunhan/workspaces/pytorch/build/lib/Caffe2_perfkernels_avx512.lib && cd ."
LINK: command "C:\PROGRA~2\MICROS~3\2019\ENTERP~1\VC\Tools\MSVC\1429~1.300\bin\Hostx64\x64\link.exe /nologo @CMakeFiles\torch_cpu.rsp /out:bin\torch_cpu.dll /implib:lib\torch_cpu.lib /pdb:bin\torch_cpu.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /INCREMENTAL:NO -WHOLEARCHIVE:C:/Users/guangyunhan/workspaces/pytorch/build/lib/caffe2_protos.lib -WHOLEARCHIVE:C:/Users/guangyunhan/workspaces/pytorch/build/lib/onnx.lib -WHOLEARCHIVE:C:/Users/guangyunhan/workspaces/pytorch/build/lib/Caffe2_perfkernels_avx.lib -WHOLEARCHIVE:C:/Users/guangyunhan/workspaces/pytorch/build/lib/Caffe2_perfkernels_avx2.lib -WHOLEARCHIVE:C:/Users/guangyunhan/workspaces/pytorch/build/lib/Caffe2_perfkernels_avx512.lib /MANIFEST /MANIFESTFILE:bin\torch_cpu.dll.manifest" failed (exit code 1120) with the following output:
Creating library lib\torch_cpu.lib and object lib\torch_cpu.exp
profiler_kineto.cpp.obj : error LNK2019: unresolved external symbol "int __cdecl libkineto::systemThreadId(void)" (?systemThreadId@libkineto@@YAHXZ) referenced in function "public: void __cdecl torch::autograd::profiler::`anonymous namespace'::KinetoThreadLocalState::reportClientActivity(struct at::RecordFunction const &,struct torch::autograd::profiler::KinetoObserverContext const *)" (?reportClientActivity@KinetoThreadLocalState@?A0x242587c1@profiler@autograd@torch@@QEAAXAEBURecordFunction@at@@PEBUKinetoObserverContext@345@@Z)
profiler_kineto.cpp.obj : error LNK2019: unresolved external symbol "int __cdecl libkineto::processId(void)" (?processId@libkineto@@YAHXZ) referenced in function "public: void __cdecl torch::autograd::profiler::`anonymous namespace'::KinetoThreadLocalState::reportClientActivity(struct at::RecordFunction const &,struct torch::autograd::profiler::KinetoObserverContext const *)" (?reportClientActivity@KinetoThreadLocalState@?A0x242587c1@profiler@autograd@torch@@QEAAXAEBURecordFunction@at@@PEBUKinetoObserverContext@345@@Z)
profiler_kineto.cpp.obj : error LNK2001: unresolved external symbol "public: virtual void __cdecl libkineto::GenericTraceActivity::log(class libkineto::ActivityLogger &)const " (?log@GenericTraceActivity@libkineto@@UEBAXAEAVActivityLogger@2@@Z)
profiler_kineto.cpp.obj : error LNK2019: unresolved external symbol "public: void __cdecl libkineto::GenericTraceActivity::addMetadata(class std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> > const &,class std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> > const &)" (?addMetadata@GenericTraceActivity@libkineto@@QEAAXAEBV?$basic_string@DU?$char_traits@D@std@@V?$allocator@D@2@@std@@0@Z) referenced in function "public: void __cdecl torch::autograd::profiler::`anonymous namespace'::KinetoThreadLocalState::finalizeCPUTrace(void)" (?finalizeCPUTrace@KinetoThreadLocalState@?A0x242587c1@profiler@autograd@torch@@QEAAXXZ)
profiler_kineto.cpp.obj : error LNK2019: unresolved external symbol suppressLibkinetoLogMessages referenced in function "void __cdecl torch::autograd::profiler::prepareProfiler(struct torch::autograd::profiler::ProfilerConfig const &,class std::set<enum torch::autograd::profiler::ActivityType,struct std::less<enum torch::autograd::profiler::ActivityType>,class std::allocator<enum torch::autograd::profiler::ActivityType> > const &)" (?prepareProfiler@profiler@autograd@torch@@YAXAEBUProfilerConfig@123@AEBV?$set@W4ActivityType@profiler@autograd@torch@@U?$less@W4ActivityType@profiler@autograd@torch@@@std@@V?$allocator@W4ActivityType@profiler@autograd@torch@@@6@@std@@@Z)
profiler_kineto.cpp.obj : error LNK2019: unresolved external symbol libkineto_init referenced in function "void __cdecl torch::autograd::profiler::prepareProfiler(struct torch::autograd::profiler::ProfilerConfig const &,class std::set<enum torch::autograd::profiler::ActivityType,struct std::less<enum torch::autograd::profiler::ActivityType>,class std::allocator<enum torch::autograd::profiler::ActivityType> > const &)" (?prepareProfiler@profiler@autograd@torch@@YAXAEBUProfilerConfig@123@AEBV?$set@W4ActivityType@profiler@autograd@torch@@U?$less@W4ActivityType@profiler@autograd@torch@@@std@@V?$allocator@W4ActivityType@profiler@autograd@torch@@@6@@std@@@Z)
profiler_kineto.cpp.obj : error LNK2019: unresolved external symbol "class libkineto::LibkinetoApi & __cdecl libkineto::api(void)" (?api@libkineto@@YAAEAVLibkinetoApi@1@XZ) referenced in function "public: class std::unique_ptr<struct at::ObserverContext,struct std::default_delete<struct at::ObserverContext> > __cdecl <lambda_2e598b199b8755931067487f4fea2be6>::operator()(struct at::RecordFunction const &)const " (??R<lambda_2e598b199b8755931067487f4fea2be6>@@QEBA?AV?$unique_ptr@UObserverContext@at@@U?$default_delete@UObserverContext@at@@@std@@@std@@AEBURecordFunction@at@@@Z)
bin\torch_cpu.dll : fatal error LNK1120: 7 unresolved externals
ninja: build stopped: subcommand failed.
I think this indicate some library issue.
For my local build on linux:
$ ldd build/lib/libtorch_cpu.so | grep cuda
libcudart.so.11.0 => /usr/local/cuda-11.1/lib64/libcudart.so.11.0 (0x00007fa9456cb000)
For official release 1.9.0:
$ ldd /home/guangyunhan/miniconda3/envs/py37/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so | grep cuda
libcudart-6d56b25a.so.11.0 => /home/guangyunhan/miniconda3/envs/py37/lib/python3.7/site-packages/torch/lib/libcudart-6d56b25a.so.11.0 (0x00007fc951e34000)
🤔
$ grep cudart -r build/third_party/tensorpipe
build/third_party/tensorpipe/CMakeFiles/Export/share/cmake/Tensorpipe/TensorpipeTargets.cmake: INTERFACE_LINK_LIBRARIES "tensorpipe;/usr/local/cuda-11.1/lib64/libcudart.so"
I am pretty sure libcudart
is coming from tensorpipe
. I suspect that if we disable tensorpipe with kineto cupti enabled, we'd suffer from similar linker error on linux.
@gdankel Any comment, if you don't mind I make cudart a PUBLIC interface library of libkineto, it will be much easier.
Abandoned.
@cloudhan Can you please clarify? Do you mean you don't plan to add this functionality?
It is in ready state.
But I am simply fed up with fragile CI infrastructure. To reprod a simple issue, I'd waste half a day to piece the floating parts together to verify 5 lines of change even if I am confident with the change, since there is no public test fot it, the test can only be triggered after import. If some tests failed again, even in the down stream repo, an immediate revert cause it no way to verfiy a potential fix easily. You again need to piece more floating parts in an isolated environment, wasting you half a day. Maybe that is because I am an outsider so that I am not trusted.
Anyway, I volunteered, and I choose to opt out, to live a happier day :)
$ grep cudart -r build/third_party/tensorpipe build/third_party/tensorpipe/CMakeFiles/Export/share/cmake/Tensorpipe/TensorpipeTargets.cmake: INTERFACE_LINK_LIBRARIES "tensorpipe;/usr/local/cuda-11.1/lib64/libcudart.so"
I am pretty sure
libcudart
is coming fromtensorpipe
. I suspect that if we disable tensorpipe with kineto cupti enabled, we'd suffer from similar linker error on linux.@gdankel Any comment, if you don't mind I make cudart a PUBLIC interface library of libkineto, it will be much easier.
I did have a sneaking suspicion that calling cuda APIs from within libkineto could get us in trouble. It may be that we can reimplement the functionality in CudaDeviceProperties.cpp without relying on cudart... would that resolve this issue?
You still depend on cuda driver lib libcuda.so
and something the like on windows, which in turn causing the library depending on nvidia driver. I think you should implement dlopen
/LoadLibrary
style dynamic library loading.
Ultimately, something like tensorflow/core/platform/default/load_library.cc and tensorflow/core/platform/windows/load_library.cc
@cloudhan I have an issue shared with multiple others here https://discuss.pytorch.org/t/pytorch-profiler-not-profiling-gpu-on-windows/146685 And i think it's related to this issue. My _supported_activities in torch.autograd only shows profileractivity.CPU and i can read here that it is because cupti is not accessible: https://github.com/pytorch/pytorch/blob/7cef7195f616f75bf25a48cf5692f704d35ac4b2/torch/profiler/profiler.py#L33
Did you ever find out a work-around for windows users? (I have cuda installed, CUDA_PATH is set correctly, and regular pytorch uses cuda just fine)
Hi, @cloudhan @aaronenyeshi @SorenJ89 @cowwoc
Is the kineto fully supported on the windows platform for now? From the PR https://github.com/pytorch/pytorch/pull/62175/files#diff-12e8125164bbfc7556b1781a8ed516e333cc0bf058acb7197f7415be44606c72R1904, the kineto windows is enabled right?
While nikita reverts it(do not know why) and we find the below code in torch cmake. When the platform is windows, the CUPTI is anyway disabled in kineto. https://github.com/pytorch/pytorch/blob/663e7600652f042dffaa1061709f867d86b3a58e/cmake/Dependencies.cmake#L1584-L1586
We wonder what the next plan is for enabling the kineto/CUPTI on windows? From the message in community, the CUPTI itself is not well supported windows?
Thank you.
Some chats from slack
me:
Gisle:
me:
In summary: on windows
cupti64*.dll
, see https://github.com/pytorch/builder/pull/815