Enable Kineto CUPTI on Windows

cloudhan commented 3 years ago

Some chats from slack

me:

@Gisle Dankel Do you have the context of why static linking is required on windows for libkineto?

Gisle:

It’s not - in fact I don’t think static linking is an option on windows? We prefer to link statically because libcupti is not always installed on the system

me:

I am starting to see the reason behind it after exploring my pytorch installation in C:\Users\guangyunhan\Miniconda3\envs\py37\lib\site-packages\torch. On windows, all CUDA related DLLs are installed int o <...>\site-packages\torch\lib\ , that means those libraries are packaged with whl file, therefore, a 3GB whl for user, but it is convenient.

This explains why

libcupti is not always installed on the system

And by

We prefer to link statically

If I take it right, if we want to be the same convenient as above, we need to provide cupti library to user. Statically linking cupti into main pytorch/libkineto DLL is an option.

Unfortunately, NV only provide us some DLLs under C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\extras\CUPTI\lib64 , so the only viable option is to package cupti library into whl

In summary: on windows

[x] also package cupti64*.dll, see https://github.com/pytorch/builder/pull/815
- Because PyTorch package all CUDA DLLs in whl
[x] Fix cupti activity timestamp semantic on windows
- manually convert time since boot --> time since epoch
[ ] Enable CUPTI on Windows, see https://github.com/pytorch/pytorch/pull/62175

cloudhan commented 3 years ago

Simply enable CUPTI will cause torch_cpu.dll reference cudart symbols

> ninja.exe .\bin\torch_cpu.dll
[1/1] Linking CXX shared library bin\torch_cpu.dll
FAILED: bin/torch_cpu.dll lib/torch_cpu.lib
cmd.exe /C "cd . && "C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\Common7\IDE\CommonExtensions\Microsoft\CMake\CMake\bin\cmake.exe" -E vs_link_dll --intdir=caffe2\CMakeFiles\torch_cpu.dir --rc=C:\PROGRA~2\WI3CF2~1\10\bin\100190~1.0\x64\rc.exe --mt=C:\PROGRA~2\WI3CF2~1\10\bin\100190~1.0\x64\mt.exe --manifests  -- C:\PROGRA~2\MICROS~3\2019\ENTERP~1\VC\Tools\MSVC\1429~1.300\bin\Hostx64\x64\link.exe /nologo @CMakeFiles\torch_cpu.rsp  /out:bin\torch_cpu.dll /implib:lib\torch_cpu.lib /pdb:bin\torch_cpu.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /INCREMENTAL:NO  -WHOLEARCHIVE:C:/Users/guangyunhan/workspaces/pytorch/build/lib/caffe2_protos.lib -WHOLEARCHIVE:C:/Users/guangyunhan/workspaces/pytorch/build/lib/onnx.lib -WHOLEARCHIVE:C:/Users/guangyunhan/workspaces/pytorch/build/lib/Caffe2_perfkernels_avx.lib -WHOLEARCHIVE:C:/Users/guangyunhan/workspaces/pytorch/build/lib/Caffe2_perfkernels_avx2.lib -WHOLEARCHIVE:C:/Users/guangyunhan/workspaces/pytorch/build/lib/Caffe2_perfkernels_avx512.lib  && cd ."
LINK: command "C:\PROGRA~2\MICROS~3\2019\ENTERP~1\VC\Tools\MSVC\1429~1.300\bin\Hostx64\x64\link.exe /nologo @CMakeFiles\torch_cpu.rsp /out:bin\torch_cpu.dll /implib:lib\torch_cpu.lib /pdb:bin\torch_cpu.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /INCREMENTAL:NO -WHOLEARCHIVE:C:/Users/guangyunhan/workspaces/pytorch/build/lib/caffe2_protos.lib -WHOLEARCHIVE:C:/Users/guangyunhan/workspaces/pytorch/build/lib/onnx.lib -WHOLEARCHIVE:C:/Users/guangyunhan/workspaces/pytorch/build/lib/Caffe2_perfkernels_avx.lib -WHOLEARCHIVE:C:/Users/guangyunhan/workspaces/pytorch/build/lib/Caffe2_perfkernels_avx2.lib -WHOLEARCHIVE:C:/Users/guangyunhan/workspaces/pytorch/build/lib/Caffe2_perfkernels_avx512.lib /MANIFEST /MANIFESTFILE:bin\torch_cpu.dll.manifest" failed (exit code 1120) with the following output:
   Creating library lib\torch_cpu.lib and object lib\torch_cpu.exp
kineto.lib(CudaDeviceProperties.cpp.obj) : error LNK2019: unresolved external symbol cudaGetDeviceCount referenced in function "class std::vector<struct cudaDeviceProp,class std::allocator<struct cudaDeviceProp> > const __cdecl libkineto::createDeviceProps(void)" (?createDeviceProps@libkineto@@YA?BV?$vector@UcudaDeviceProp@@V?$allocator@UcudaDeviceProp@@@std@@@std@@XZ)
kineto.lib(CudaDeviceProperties.cpp.obj) : error LNK2019: unresolved external symbol cudaGetDeviceProperties referenced in function "class std::vector<struct cudaDeviceProp,class std::allocator<struct cudaDeviceProp> > const __cdecl libkineto::createDeviceProps(void)" (?createDeviceProps@libkineto@@YA?BV?$vector@UcudaDeviceProp@@V?$allocator@UcudaDeviceProp@@@std@@@std@@XZ)
bin\torch_cpu.dll : fatal error LNK1120: 2 unresolved externals
ninja: build stopped: subcommand failed.

all two symbols are boils down to CudaDeviceProperties.cpp. It it possible that we link cudart_static.lib into torch_cpu.dll statically and hide the symbol. But it might cause multiple instances of cudart objects living in PyTorch application, which is a big no no IMHO.

Simply moving kineto to Caffe2_CUDA_DEPENDENCY_LIBS instead of Caffe2_DEPENDENCY_LIBS causes following linker errors:

[2/2] Linking CXX shared library bin\torch_cpu.dll
FAILED: bin/torch_cpu.dll lib/torch_cpu.lib
cmd.exe /C "cd . && "C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\Common7\IDE\CommonExtensions\Microsoft\CMake\CMake\bin\cmake.exe" -E vs_link_dll --intdir=caffe2\CMakeFiles\torch_cpu.dir --rc=C:\PROGRA~2\WI3CF2~1\10\bin\100190~1.0\x64\rc.exe --mt=C:\PROGRA~2\WI3CF2~1\10\bin\100190~1.0\x64\mt.exe --manifests  -- C:\PROGRA~2\MICROS~3\2019\ENTERP~1\VC\Tools\MSVC\1429~1.300\bin\Hostx64\x64\link.exe /nologo @CMakeFiles\torch_cpu.rsp  /out:bin\torch_cpu.dll /implib:lib\torch_cpu.lib /pdb:bin\torch_cpu.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /INCREMENTAL:NO  -WHOLEARCHIVE:C:/Users/guangyunhan/workspaces/pytorch/build/lib/caffe2_protos.lib -WHOLEARCHIVE:C:/Users/guangyunhan/workspaces/pytorch/build/lib/onnx.lib -WHOLEARCHIVE:C:/Users/guangyunhan/workspaces/pytorch/build/lib/Caffe2_perfkernels_avx.lib -WHOLEARCHIVE:C:/Users/guangyunhan/workspaces/pytorch/build/lib/Caffe2_perfkernels_avx2.lib -WHOLEARCHIVE:C:/Users/guangyunhan/workspaces/pytorch/build/lib/Caffe2_perfkernels_avx512.lib  && cd ."
LINK: command "C:\PROGRA~2\MICROS~3\2019\ENTERP~1\VC\Tools\MSVC\1429~1.300\bin\Hostx64\x64\link.exe /nologo @CMakeFiles\torch_cpu.rsp /out:bin\torch_cpu.dll /implib:lib\torch_cpu.lib /pdb:bin\torch_cpu.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /INCREMENTAL:NO -WHOLEARCHIVE:C:/Users/guangyunhan/workspaces/pytorch/build/lib/caffe2_protos.lib -WHOLEARCHIVE:C:/Users/guangyunhan/workspaces/pytorch/build/lib/onnx.lib -WHOLEARCHIVE:C:/Users/guangyunhan/workspaces/pytorch/build/lib/Caffe2_perfkernels_avx.lib -WHOLEARCHIVE:C:/Users/guangyunhan/workspaces/pytorch/build/lib/Caffe2_perfkernels_avx2.lib -WHOLEARCHIVE:C:/Users/guangyunhan/workspaces/pytorch/build/lib/Caffe2_perfkernels_avx512.lib /MANIFEST /MANIFESTFILE:bin\torch_cpu.dll.manifest" failed (exit code 1120) with the following output:
   Creating library lib\torch_cpu.lib and object lib\torch_cpu.exp
profiler_kineto.cpp.obj : error LNK2019: unresolved external symbol "int __cdecl libkineto::systemThreadId(void)" (?systemThreadId@libkineto@@YAHXZ) referenced in function "public: void __cdecl torch::autograd::profiler::`anonymous namespace'::KinetoThreadLocalState::reportClientActivity(struct at::RecordFunction const &,struct torch::autograd::profiler::KinetoObserverContext const *)" (?reportClientActivity@KinetoThreadLocalState@?A0x242587c1@profiler@autograd@torch@@QEAAXAEBURecordFunction@at@@PEBUKinetoObserverContext@345@@Z)
profiler_kineto.cpp.obj : error LNK2019: unresolved external symbol "int __cdecl libkineto::processId(void)" (?processId@libkineto@@YAHXZ) referenced in function "public: void __cdecl torch::autograd::profiler::`anonymous namespace'::KinetoThreadLocalState::reportClientActivity(struct at::RecordFunction const &,struct torch::autograd::profiler::KinetoObserverContext const *)" (?reportClientActivity@KinetoThreadLocalState@?A0x242587c1@profiler@autograd@torch@@QEAAXAEBURecordFunction@at@@PEBUKinetoObserverContext@345@@Z)
profiler_kineto.cpp.obj : error LNK2001: unresolved external symbol "public: virtual void __cdecl libkineto::GenericTraceActivity::log(class libkineto::ActivityLogger &)const " (?log@GenericTraceActivity@libkineto@@UEBAXAEAVActivityLogger@2@@Z)
profiler_kineto.cpp.obj : error LNK2019: unresolved external symbol "public: void __cdecl libkineto::GenericTraceActivity::addMetadata(class std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> > const &,class std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> > const &)" (?addMetadata@GenericTraceActivity@libkineto@@QEAAXAEBV?$basic_string@DU?$char_traits@D@std@@V?$allocator@D@2@@std@@0@Z) referenced in function "public: void __cdecl torch::autograd::profiler::`anonymous namespace'::KinetoThreadLocalState::finalizeCPUTrace(void)" (?finalizeCPUTrace@KinetoThreadLocalState@?A0x242587c1@profiler@autograd@torch@@QEAAXXZ)
profiler_kineto.cpp.obj : error LNK2019: unresolved external symbol suppressLibkinetoLogMessages referenced in function "void __cdecl torch::autograd::profiler::prepareProfiler(struct torch::autograd::profiler::ProfilerConfig const &,class std::set<enum torch::autograd::profiler::ActivityType,struct std::less<enum torch::autograd::profiler::ActivityType>,class std::allocator<enum torch::autograd::profiler::ActivityType> > const &)" (?prepareProfiler@profiler@autograd@torch@@YAXAEBUProfilerConfig@123@AEBV?$set@W4ActivityType@profiler@autograd@torch@@U?$less@W4ActivityType@profiler@autograd@torch@@@std@@V?$allocator@W4ActivityType@profiler@autograd@torch@@@6@@std@@@Z)
profiler_kineto.cpp.obj : error LNK2019: unresolved external symbol libkineto_init referenced in function "void __cdecl torch::autograd::profiler::prepareProfiler(struct torch::autograd::profiler::ProfilerConfig const &,class std::set<enum torch::autograd::profiler::ActivityType,struct std::less<enum torch::autograd::profiler::ActivityType>,class std::allocator<enum torch::autograd::profiler::ActivityType> > const &)" (?prepareProfiler@profiler@autograd@torch@@YAXAEBUProfilerConfig@123@AEBV?$set@W4ActivityType@profiler@autograd@torch@@U?$less@W4ActivityType@profiler@autograd@torch@@@std@@V?$allocator@W4ActivityType@profiler@autograd@torch@@@6@@std@@@Z)
profiler_kineto.cpp.obj : error LNK2019: unresolved external symbol "class libkineto::LibkinetoApi & __cdecl libkineto::api(void)" (?api@libkineto@@YAAEAVLibkinetoApi@1@XZ) referenced in function "public: class std::unique_ptr<struct at::ObserverContext,struct std::default_delete<struct at::ObserverContext> > __cdecl <lambda_2e598b199b8755931067487f4fea2be6>::operator()(struct at::RecordFunction const &)const " (??R<lambda_2e598b199b8755931067487f4fea2be6>@@QEBA?AV?$unique_ptr@UObserverContext@at@@U?$default_delete@UObserverContext@at@@@std@@@std@@AEBURecordFunction@at@@@Z)
bin\torch_cpu.dll : fatal error LNK1120: 7 unresolved externals
ninja: build stopped: subcommand failed.

I think this indicate some library issue.

cloudhan commented 3 years ago

For my local build on linux:

$ ldd build/lib/libtorch_cpu.so | grep cuda 
        libcudart.so.11.0 => /usr/local/cuda-11.1/lib64/libcudart.so.11.0 (0x00007fa9456cb000)

For official release 1.9.0:

$ ldd /home/guangyunhan/miniconda3/envs/py37/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so | grep cuda
        libcudart-6d56b25a.so.11.0 => /home/guangyunhan/miniconda3/envs/py37/lib/python3.7/site-packages/torch/lib/libcudart-6d56b25a.so.11.0 (0x00007fc951e34000)

🤔

cloudhan commented 3 years ago

$ grep cudart -r build/third_party/tensorpipe 
build/third_party/tensorpipe/CMakeFiles/Export/share/cmake/Tensorpipe/TensorpipeTargets.cmake:  INTERFACE_LINK_LIBRARIES "tensorpipe;/usr/local/cuda-11.1/lib64/libcudart.so"

I am pretty sure libcudart is coming from tensorpipe. I suspect that if we disable tensorpipe with kineto cupti enabled, we'd suffer from similar linker error on linux.

@gdankel Any comment, if you don't mind I make cudart a PUBLIC interface library of libkineto, it will be much easier.

cloudhan commented 3 years ago

Abandoned.

cowwoc commented 3 years ago

@cloudhan Can you please clarify? Do you mean you don't plan to add this functionality?

cloudhan commented 3 years ago

It is in ready state.

But I am simply fed up with fragile CI infrastructure. To reprod a simple issue, I'd waste half a day to piece the floating parts together to verify 5 lines of change even if I am confident with the change, since there is no public test fot it, the test can only be triggered after import. If some tests failed again, even in the down stream repo, an immediate revert cause it no way to verfiy a potential fix easily. You again need to piece more floating parts in an isolated environment, wasting you half a day. Maybe that is because I am an outsider so that I am not trusted.

Anyway, I volunteered, and I choose to opt out, to live a happier day :)

gdankel commented 2 years ago

$ grep cudart -r build/third_party/tensorpipe 
build/third_party/tensorpipe/CMakeFiles/Export/share/cmake/Tensorpipe/TensorpipeTargets.cmake:  INTERFACE_LINK_LIBRARIES "tensorpipe;/usr/local/cuda-11.1/lib64/libcudart.so"
I am pretty sure libcudart is coming from tensorpipe. I suspect that if we disable tensorpipe with kineto cupti enabled, we'd suffer from similar linker error on linux.

@gdankel Any comment, if you don't mind I make cudart a PUBLIC interface library of libkineto, it will be much easier.

I did have a sneaking suspicion that calling cuda APIs from within libkineto could get us in trouble. It may be that we can reimplement the functionality in CudaDeviceProperties.cpp without relying on cudart... would that resolve this issue?

cloudhan commented 2 years ago

You still depend on cuda driver lib libcuda.so and something the like on windows, which in turn causing the library depending on nvidia driver. I think you should implement dlopen/LoadLibrary style dynamic library loading. Ultimately, something like tensorflow/core/platform/default/load_library.cc and tensorflow/core/platform/windows/load_library.cc

SorenJ89 commented 1 year ago

@cloudhan I have an issue shared with multiple others here https://discuss.pytorch.org/t/pytorch-profiler-not-profiling-gpu-on-windows/146685 And i think it's related to this issue. My _supported_activities in torch.autograd only shows profileractivity.CPU and i can read here that it is because cupti is not accessible: https://github.com/pytorch/pytorch/blob/7cef7195f616f75bf25a48cf5692f704d35ac4b2/torch/profiler/profiler.py#L33

Did you ever find out a work-around for windows users? (I have cuda installed, CUDA_PATH is set correctly, and regular pytorch uses cuda just fine)

zejun-chen commented 6 days ago

Hi, @cloudhan @aaronenyeshi @SorenJ89 @cowwoc

Is the kineto fully supported on the windows platform for now? From the PR https://github.com/pytorch/pytorch/pull/62175/files#diff-12e8125164bbfc7556b1781a8ed516e333cc0bf058acb7197f7415be44606c72R1904, the kineto windows is enabled right?

While nikita reverts it(do not know why) and we find the below code in torch cmake. When the platform is windows, the CUPTI is anyway disabled in kineto. https://github.com/pytorch/pytorch/blob/663e7600652f042dffaa1061709f867d86b3a58e/cmake/Dependencies.cmake#L1584-L1586

We wonder what the next plan is for enabling the kineto/CUPTI on windows? From the message in community, the CUPTI itself is not well supported windows?

Thank you.

pytorch / kineto

Enable Kineto CUPTI on Windows #356