[Build] fail to build `rel-1.19.0` vs CUDA 12.6 on Windows

Describe the issue

Unable to build the ONNX Runtime our of release candidate branch on Windows against CUDA 12.6

Urgency

This issue is vital if release plans to support CUDA 12.6

Target platform

Windows

Build script

build.bat --cmake_generator "Visual Studio 17 2022" --config Release --cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=60;61;70;75;80;86;90" --skip_submodule_sync --parallel --build_shared_lib --compile_no_warning_as_error --skip_tests --update --build --build_dir /workspace/build --use_cuda --cuda_version "12.6" --cuda_home "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6" --cudnn_home "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6" --use_tensorrt --tensorrt_home "/tensorrt"

Error / output

It could be dependencies related


       "C:\tmp\tritonbuild\onnxruntime\build\install.vcxproj" (default target) (1) ->
       "C:\tmp\tritonbuild\onnxruntime\build\ALL_BUILD.vcxproj" (default target) (3) ->
       "C:\tmp\tritonbuild\onnxruntime\build\triton-onnxruntime-backend.vcxproj" (default target) (16) ->
       "C:\tmp\tritonbuild\onnxruntime\build\ort_target.vcxproj" (default target) (18) ->
       (CustomBuild target) -> 
         C:\workspace\build\Release\_deps\cutlass-src\include\cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp(136): error C2061: syntax error: identifier 'SharedStorage' [C:\workspace\build\Release\onnxruntime_providers_cuda.vcxproj] [C:\tmp\tritonbuild\onnxruntime\build\ort_target.vcxproj]
         C:\workspace\build\Release\_deps\cutlass-src\include\cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp(140): error C3646: 'math_wg_order': unknown override specifier [C:\workspace\build\Release\onnxruntime_providers_cuda.vcxproj] [C:\tmp\tritonbuild\onnxruntime\build\ort_target.vcxproj]
         C:\BuildTools\MSBuild\Microsoft\VC\v170\BuildCustomizations\CUDA 12.6.targets(799,9): error MSB3721: The command ""C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin\nvcc.exe"  --use-local-env -ccbin "C:\BuildTools\VC\Tools\MSVC\14.39.33519\bin\HostX64\x64" -x cu   -I"C:\workspace\build\Release\_deps\utf8_range-src" -IC:\workspace\onnxruntime\include\onnxruntime -IC:\workspace\onnxruntime\include\onnxruntime\core\session -I"C:\workspace\build\Release\_deps\pytorch_cpuinfo-src\include" -IC:\workspace\build\Release -IC:\workspace\onnxruntime\onnxruntime -I"C:\workspace\build\Release\_deps\abseil_cpp-src" -I"C:\workspace\build\Release\_deps\safeint-src" -I"C:\workspace\build\Release\_deps\gsl-src\include" -I"C:\workspace\build\Release\_deps\date-src\include" -I"C:\workspace\build\Release\_deps\onnx-src" -I"C:\workspace\build\Release\_deps\onnx-build" -I"C:\workspace\build\Release\_deps\protobuf-src\src" -I"C:\workspace\build\Release\_deps\flatbuffers-src\include" -I"C:\workspace\build\Release\_deps\cutlass-src\include" -I"C:\workspace\build\Release\_deps\cutlass-src\examples" -I"C:\workspace\build\Release\_deps\cutlass-src\tools\util\include" -I"C:\workspace\build\Release\_deps\eigen-src" -I\TensorRT\include -I"C:\workspace\build\Release\_deps\mp11-src\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\include"     --keep-dir onnxrunt.4B28B068\x64\Release  -maxrregcount=0   --machine 64 --compile -cudart shared -allow-unsupported-compiler --expt-relaxed-constexpr --Werror default-stream-launch -Xcudafe --diag_suppress=bad_friend_decl -Xcudafe --diag_suppress=unsigned_compare_with_zero -Xcudafe --diag_suppress=expr_has_no_effect -include algorithm -std=c++17 --generate-code=arch=compute_60,code=[compute_60,sm_60] --generate-code=arch=compute_61,code=[compute_61,sm_61] --generate-code=arch=compute_70,code=[compute_70,sm_70] --generate-code=arch=compute_75,code=[compute_75,sm_75] --generate-code=arch=compute_80,code=[compute_80,sm_80] --generate-code=arch=compute_86,code=[compute_86,sm_86] --generate-code=arch=compute_90,code=[compute_90,sm_90] -Xcudafe --diag_suppress=conversion_function_not_usable --threads 1 -Xcompiler="/EHsc -Ob2 -Zi /utf-8 /sdl /experimental:external /external:W0 /external:templates- /external:IC:/workspace/onnxruntime/cmake /external:IC:/workspace/build/Release /wd4251 /wd4201 /wd4324 /wd5054 /w15038 /wd4251 /wd4201 /wd4324 /wd5054 /w15038 /wd4834 /wd4127 /Zc:__cplusplus"   -D_WINDOWS -DNDEBUG -DVER_MAJOR=1 -DVER_MINOR=19 -DVER_BUILD=0 -DVER_PRIVATE=0 -D"VER_STRING=\"ORT_VERSION\"" -DCPUINFO_SUPPORTED_PLATFORM=1 -DEIGEN_USE_THREADS -DDISABLE_CUSPARSE_DEPRECATED -DPLATFORM_WINDOWS -DNOGDI -DNOMINMAX -D_USE_MATH_DEFINES -D_SILENCE_ALL_CXX17_DEPRECATION_WARNINGS -DUSE_CUDA=1 -DUSE_FLASH_ATTENTION=1 -DUSE_MEMORY_EFFICIENT_ATTENTION=1 -DUSE_TENSORRT=1 -DONLY_C_LOCALE=0 -DONNX_NAMESPACE=onnx -DONNX_ML=1 -DONNX_USE_LITE_PROTO=1 -D__ONNX_NO_DOC_STRINGS -DWIN32_LEAN_AND_MEAN -DORT_ENABLE_STREAM -DEIGEN_MPL2_ONLY -DEIGEN_HAS_CONSTEXPR -DEIGEN_HAS_VARIADIC_TEMPLATES -DEIGEN_HAS_CXX11_MATH -DEIGEN_HAS_CXX11_ATOMIC -DEIGEN_STRONG_INLINE=inline -D"CMAKE_INTDIR=\"Release\"" -Donnxruntime_providers_cuda_EXPORTS -D_WINDLL -D_MBCS -DEIGEN_HAS_C99_MATH -DCPUINFO_SUPPORTED -DNDEBUG -DVER_MAJOR=1 -DVER_MINOR=19 -DVER_BUILD=0 -DVER_PRIVATE=0 -D"VER_STRING=\"ORT_VERSION\"" -DCPUINFO_SUPPORTED_PLATFORM=1 -DEIGEN_USE_THREADS -DDISABLE_CUSPARSE_DEPRECATED -DPLATFORM_WINDOWS -DNOGDI -DNOMINMAX -D_USE_MATH_DEFINES -D_SILENCE_ALL_CXX17_DEPRECATION_WARNINGS -DUSE_CUDA=1 -DUSE_FLASH_ATTENTION=1 -DUSE_MEMORY_EFFICIENT_ATTENTION=1 -DUSE_TENSORRT=1 -DONLY_C_LOCALE=0 -DONNX_NAMESPACE=onnx -DONNX_ML=1 -DONNX_USE_LITE_PROTO=1 -D__ONNX_NO_DOC_STRINGS -DWIN32_LEAN_AND_MEAN -DORT_ENABLE_STREAM -DEIGEN_MPL2_ONLY -DEIGEN_HAS_CONSTEXPR -DEIGEN_HAS_VARIADIC_TEMPLATES -DEIGEN_HAS_CXX11_MATH -DEIGEN_HAS_CXX11_ATOMIC -DEIGEN_STRONG_INLINE=inline -D"CMAKE_INTDIR=\"Release\"" -Donnxruntime_providers_cuda_EXPORTS -Xcompiler "/EHsc /W4 /nologo /O2 /FS   /MD /GR" -Xcompiler "/Fdonnxruntime_providers_cuda.dir\Release\vc143.pdb" -o onnxruntime_providers_cuda.dir\Release\moe_gemm_kernels_fp16_fp16.obj "C:\workspace\onnxruntime\onnxruntime\contrib_ops\cuda\moe\ft_moe\moe_gemm_kernels_fp16_fp16.cu"" exited with code 2. [C:\workspace\build\Release\onnxruntime_providers_cuda.vcxproj] [C:\tmp\tritonbuild\onnxruntime\build\ort_target.vcxproj]
         C:\workspace\build\Release\_deps\cutlass-src\include\cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp(136): error C2061: syntax error: identifier 'SharedStorage' [C:\workspace\build\Release\onnxruntime_providers_cuda.vcxproj] [C:\tmp\tritonbuild\onnxruntime\build\ort_target.vcxproj]
         C:\workspace\build\Release\_deps\cutlass-src\include\cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp(140): error C3646: 'math_wg_order': unknown override specifier [C:\workspace\build\Release\onnxruntime_providers_cuda.vcxproj] [C:\tmp\tritonbuild\onnxruntime\build\ort_target.vcxproj]
         C:\BuildTools\MSBuild\Microsoft\VC\v170\BuildCustomizations\CUDA 12.6.targets(799,9): error MSB3721: The command ""C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin\nvcc.exe"  --use-local-env -ccbin "C:\BuildTools\VC\Tools\MSVC\14.39.33519\bin\HostX64\x64" -x cu   -I"C:\workspace\build\Release\_deps\utf8_range-src" -IC:\workspace\onnxruntime\include\onnxruntime -IC:\workspace\onnxruntime\include\onnxruntime\core\session -I"C:\workspace\build\Release\_deps\pytorch_cpuinfo-src\include" -IC:\workspace\build\Release -IC:\workspace\onnxruntime\onnxruntime -I"C:\workspace\build\Release\_deps\abseil_cpp-src" -I"C:\workspace\build\Release\_deps\safeint-src" -I"C:\workspace\build\Release\_deps\gsl-src\include" -I"C:\workspace\build\Release\_deps\date-src\include" -I"C:\workspace\build\Release\_deps\onnx-src" -I"C:\workspace\build\Release\_deps\onnx-build" -I"C:\workspace\build\Release\_deps\protobuf-src\src" -I"C:\workspace\build\Release\_deps\flatbuffers-src\include" -I"C:\workspace\build\Release\_deps\cutlass-src\include" -I"C:\workspace\build\Release\_deps\cutlass-src\examples" -I"C:\workspace\build\Release\_deps\cutlass-src\tools\util\include" -I"C:\workspace\build\Release\_deps\eigen-src" -I\TensorRT\include -I"C:\workspace\build\Release\_deps\mp11-src\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\include"     --keep-dir onnxrunt.4B28B068\x64\Release  -maxrregcount=0   --machine 64 --compile -cudart shared -allow-unsupported-compiler --expt-relaxed-constexpr --Werror default-stream-launch -Xcudafe --diag_suppress=bad_friend_decl -Xcudafe --diag_suppress=unsigned_compare_with_zero -Xcudafe --diag_suppress=expr_has_no_effect -include algorithm -std=c++17 --generate-code=arch=compute_60,code=[compute_60,sm_60] --generate-code=arch=compute_61,code=[compute_61,sm_61] --generate-code=arch=compute_70,code=[compute_70,sm_70] --generate-code=arch=compute_75,code=[compute_75,sm_75] --generate-code=arch=compute_80,code=[compute_80,sm_80] --generate-code=arch=compute_86,code=[compute_86,sm_86] --generate-code=arch=compute_90,code=[compute_90,sm_90] -Xcudafe --diag_suppress=conversion_function_not_usable --threads 1 -Xcompiler="/EHsc -Ob2 -Zi /utf-8 /sdl /experimental:external /external:W0 /external:templates- /external:IC:/workspace/onnxruntime/cmake /external:IC:/workspace/build/Release /wd4251 /wd4201 /wd4324 /wd5054 /w15038 /wd4251 /wd4201 /wd4324 /wd5054 /w15038 /wd4834 /wd4127 /Zc:__cplusplus"   -D_WINDOWS -DNDEBUG -DVER_MAJOR=1 -DVER_MINOR=19 -DVER_BUILD=0 -DVER_PRIVATE=0 -D"VER_STRING=\"ORT_VERSION\"" -DCPUINFO_SUPPORTED_PLATFORM=1 -DEIGEN_USE_THREADS -DDISABLE_CUSPARSE_DEPRECATED -DPLATFORM_WINDOWS -DNOGDI -DNOMINMAX -D_USE_MATH_DEFINES -D_SILENCE_ALL_CXX17_DEPRECATION_WARNINGS -DUSE_CUDA=1 -DUSE_FLASH_ATTENTION=1 -DUSE_MEMORY_EFFICIENT_ATTENTION=1 -DUSE_TENSORRT=1 -DONLY_C_LOCALE=0 -DONNX_NAMESPACE=onnx -DONNX_ML=1 -DONNX_USE_LITE_PROTO=1 -D__ONNX_NO_DOC_STRINGS -DWIN32_LEAN_AND_MEAN -DORT_ENABLE_STREAM -DEIGEN_MPL2_ONLY -DEIGEN_HAS_CONSTEXPR -DEIGEN_HAS_VARIADIC_TEMPLATES -DEIGEN_HAS_CXX11_MATH -DEIGEN_HAS_CXX11_ATOMIC -DEIGEN_STRONG_INLINE=inline -D"CMAKE_INTDIR=\"Release\"" -Donnxruntime_providers_cuda_EXPORTS -D_WINDLL -D_MBCS -DEIGEN_HAS_C99_MATH -DCPUINFO_SUPPORTED -DNDEBUG -DVER_MAJOR=1 -DVER_MINOR=19 -DVER_BUILD=0 -DVER_PRIVATE=0 -D"VER_STRING=\"ORT_VERSION\"" -DCPUINFO_SUPPORTED_PLATFORM=1 -DEIGEN_USE_THREADS -DDISABLE_CUSPARSE_DEPRECATED -DPLATFORM_WINDOWS -DNOGDI -DNOMINMAX -D_USE_MATH_DEFINES -D_SILENCE_ALL_CXX17_DEPRECATION_WARNINGS -DUSE_CUDA=1 -DUSE_FLASH_ATTENTION=1 -DUSE_MEMORY_EFFICIENT_ATTENTION=1 -DUSE_TENSORRT=1 -DONLY_C_LOCALE=0 -DONNX_NAMESPACE=onnx -DONNX_ML=1 -DONNX_USE_LITE_PROTO=1 -D__ONNX_NO_DOC_STRINGS -DWIN32_LEAN_AND_MEAN -DORT_ENABLE_STREAM -DEIGEN_MPL2_ONLY -DEIGEN_HAS_CONSTEXPR -DEIGEN_HAS_VARIADIC_TEMPLATES -DEIGEN_HAS_CXX11_MATH -DEIGEN_HAS_CXX11_ATOMIC -DEIGEN_STRONG_INLINE=inline -D"CMAKE_INTDIR=\"Release\"" -Donnxruntime_providers_cuda_EXPORTS -Xcompiler "/EHsc /W4 /nologo /O2 /FS   /MD /GR" -Xcompiler "/Fdonnxruntime_providers_cuda.dir\Release\vc143.pdb" -o onnxruntime_providers_cuda.dir\Release\moe_gemm_kernels_fp16_uint4.obj "C:\workspace\onnxruntime\onnxruntime\contrib_ops\cuda\moe\ft_moe\moe_gemm_kernels_fp16_uint4.cu"" exited with code 2. [C:\workspace\build\Release\onnxruntime_providers_cuda.vcxproj] [C:\tmp\tritonbuild\onnxruntime\build\ort_target.vcxproj]
         C:\workspace\build\Release\_deps\cutlass-src\include\cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp(136): error C2061: syntax error: identifier 'SharedStorage' [C:\workspace\build\Release\onnxruntime_providers_cuda.vcxproj] [C:\tmp\tritonbuild\onnxruntime\build\ort_target.vcxproj]
         C:\workspace\build\Release\_deps\cutlass-src\include\cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp(140): error C3646: 'math_wg_order': unknown override specifier [C:\workspace\build\Release\onnxruntime_providers_cuda.vcxproj] [C:\tmp\tritonbuild\onnxruntime\build\ort_target.vcxproj]
         C:\BuildTools\MSBuild\Microsoft\VC\v170\BuildCustomizations\CUDA 12.6.targets(799,9): error MSB3721: The command ""C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin\nvcc.exe"  --use-local-env -ccbin "C:\BuildTools\VC\Tools\MSVC\14.39.33519\bin\HostX64\x64" -x cu   -I"C:\workspace\build\Release\_deps\utf8_range-src" -IC:\workspace\onnxruntime\include\onnxruntime -IC:\workspace\onnxruntime\include\onnxruntime\core\session -I"C:\workspace\build\Release\_deps\pytorch_cpuinfo-src\include" -IC:\workspace\build\Release -IC:\workspace\onnxruntime\onnxruntime -I"C:\workspace\build\Release\_deps\abseil_cpp-src" -I"C:\workspace\build\Release\_deps\safeint-src" -I"C:\workspace\build\Release\_deps\gsl-src\include" -I"C:\workspace\build\Release\_deps\date-src\include" -I"C:\workspace\build\Release\_deps\onnx-src" -I"C:\workspace\build\Release\_deps\onnx-build" -I"C:\workspace\build\Release\_deps\protobuf-src\src" -I"C:\workspace\build\Release\_deps\flatbuffers-src\include" -I"C:\workspace\build\Release\_deps\cutlass-src\include" -I"C:\workspace\build\Release\_deps\cutlass-src\examples" -I"C:\workspace\build\Release\_deps\cutlass-src\tools\util\include" -I"C:\workspace\build\Release\_deps\eigen-src" -I\TensorRT\include -I"C:\workspace\build\Release\_deps\mp11-src\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\include"     --keep-dir onnxrunt.4B28B068\x64\Release  -maxrregcount=0   --machine 64 --compile -cudart shared -allow-unsupported-compiler --expt-relaxed-constexpr --Werror default-stream-launch -Xcudafe --diag_suppress=bad_friend_decl -Xcudafe --diag_suppress=unsigned_compare_with_zero -Xcudafe --diag_suppress=expr_has_no_effect -include algorithm -std=c++17 --generate-code=arch=compute_60,code=[compute_60,sm_60] --generate-code=arch=compute_61,code=[compute_61,sm_61] --generate-code=arch=compute_70,code=[compute_70,sm_70] --generate-code=arch=compute_75,code=[compute_75,sm_75] --generate-code=arch=compute_80,code=[compute_80,sm_80] --generate-code=arch=compute_86,code=[compute_86,sm_86] --generate-code=arch=compute_90,code=[compute_90,sm_90] -Xcudafe --diag_suppress=conversion_function_not_usable --threads 1 -Xcompiler="/EHsc -Ob2 -Zi /utf-8 /sdl /experimental:external /external:W0 /external:templates- /external:IC:/workspace/onnxruntime/cmake /external:IC:/workspace/build/Release /wd4251 /wd4201 /wd4324 /wd5054 /w15038 /wd4251 /wd4201 /wd4324 /wd5054 /w15038 /wd4834 /wd4127 /Zc:__cplusplus"   -D_WINDOWS -DNDEBUG -DVER_MAJOR=1 -DVER_MINOR=19 -DVER_BUILD=0 -DVER_PRIVATE=0 -D"VER_STRING=\"ORT_VERSION\"" -DCPUINFO_SUPPORTED_PLATFORM=1 -DEIGEN_USE_THREADS -DDISABLE_CUSPARSE_DEPRECATED -DPLATFORM_WINDOWS -DNOGDI -DNOMINMAX -D_USE_MATH_DEFINES -D_SILENCE_ALL_CXX17_DEPRECATION_WARNINGS -DUSE_CUDA=1 -DUSE_FLASH_ATTENTION=1 -DUSE_MEMORY_EFFICIENT_ATTENTION=1 -DUSE_TENSORRT=1 -DONLY_C_LOCALE=0 -DONNX_NAMESPACE=onnx -DONNX_ML=1 -DONNX_USE_LITE_PROTO=1 -D__ONNX_NO_DOC_STRINGS -DWIN32_LEAN_AND_MEAN -DORT_ENABLE_STREAM -DEIGEN_MPL2_ONLY -DEIGEN_HAS_CONSTEXPR -DEIGEN_HAS_VARIADIC_TEMPLATES -DEIGEN_HAS_CXX11_MATH -DEIGEN_HAS_CXX11_ATOMIC -DEIGEN_STRONG_INLINE=inline -D"CMAKE_INTDIR=\"Release\"" -Donnxruntime_providers_cuda_EXPORTS -D_WINDLL -D_MBCS -DEIGEN_HAS_C99_MATH -DCPUINFO_SUPPORTED -DNDEBUG -DVER_MAJOR=1 -DVER_MINOR=19 -DVER_BUILD=0 -DVER_PRIVATE=0 -D"VER_STRING=\"ORT_VERSION\"" -DCPUINFO_SUPPORTED_PLATFORM=1 -DEIGEN_USE_THREADS -DDISABLE_CUSPARSE_DEPRECATED -DPLATFORM_WINDOWS -DNOGDI -DNOMINMAX -D_USE_MATH_DEFINES -D_SILENCE_ALL_CXX17_DEPRECATION_WARNINGS -DUSE_CUDA=1 -DUSE_FLASH_ATTENTION=1 -DUSE_MEMORY_EFFICIENT_ATTENTION=1 -DUSE_TENSORRT=1 -DONLY_C_LOCALE=0 -DONNX_NAMESPACE=onnx -DONNX_ML=1 -DONNX_USE_LITE_PROTO=1 -D__ONNX_NO_DOC_STRINGS -DWIN32_LEAN_AND_MEAN -DORT_ENABLE_STREAM -DEIGEN_MPL2_ONLY -DEIGEN_HAS_CONSTEXPR -DEIGEN_HAS_VARIADIC_TEMPLATES -DEIGEN_HAS_CXX11_MATH -DEIGEN_HAS_CXX11_ATOMIC -DEIGEN_STRONG_INLINE=inline -D"CMAKE_INTDIR=\"Release\"" -Donnxruntime_providers_cuda_EXPORTS -Xcompiler "/EHsc /W4 /nologo /O2 /FS   /MD /GR" -Xcompiler "/Fdonnxruntime_providers_cuda.dir\Release\vc143.pdb" -o onnxruntime_providers_cuda.dir\Release\moe_gemm_kernels_fp32_fp32.obj "C:\workspace\onnxruntime\onnxruntime\contrib_ops\cuda\moe\ft_moe\moe_gemm_kernels_fp32_fp32.cu"" exited with code 2. [C:\workspace\build\Release\onnxruntime_providers_cuda.vcxproj] [C:\tmp\tritonbuild\onnxruntime\build\ort_target.vcxproj]
         C:\BuildTools\MSBuild\Microsoft\VC\v170\Microsoft.CppCommon.targets(254,5): error MSB8066: Custom build for 'C:\tmp\tritonbuild\onnxruntime\build\CMakeFiles\1391fbda87be57075fb5bba7a38c2954\onnxruntime.rule;C:\tmp\tritonbuild\onnxruntime\build\CMakeFiles\c0b7ec8ce4dc22ca22ac8622f7a49e15\ort_target.rule;C:\tmp\tritonbuild\onnxruntime\CMakeLists.txt' exited with code 1. [C:\tmp\tritonbuild\onnxruntime\build\ort_target.vcxproj]

Visual Studio Version

BUILDTOOLS_VERSION:17.9.34622.214

GCC / Compiler Version

No response

@tianleiwu have you tried building cuda ep with cuda12.6? I wonder if cutlass need to be updated to fit cuda12.6

@wangyems, it seems that there is build error in MOE gemm code with cuda 12.6. Please help take a look:

tmpxft_000010f0_00000000-7_moe_gemm_kernels_fp16_fp16.cudafe1.cpp D:\git\onnxruntime\build\cuda12\Release_deps\cutlass-src\include\cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp(136): error C2061: syntax error: identifier 'SharedStorage' [D:\git\onnxruntime\build\cuda12\Release\onnxruntime_providers_cuda_obj.vcxproj] D:\git\onnxruntime\build\cuda12\Release_deps\cutlass-src\include\cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp(136): note: the template instantiation context (the oldest one first) is D:\git\onnxruntime\build\cuda12\Release_deps\cutlass-src\include\cutlass/gemm/kernel/sm90_gemm_tma_warpspecializedpingpong.hpp(60): note: while compiling class template partial specialization 'cutlass::gemm::kernel::GemmUniversal<ProblemShape,CollectiveMainloop,C ollectiveEpilogue,TileScheduler_,enable_if<std::is_base_ofv<cutlass::gemm::KernelTmaWarpSpecializedPingpong,CollectiveMainloop::DispatchPolicy::Schedule>,void>::type>' D:\git\onnxruntime\build\cuda12\Release_deps\cutlass-src\include\cutlass/gemm/kernel/sm90_gemm_tma_warpspecializedpingpong.hpp(124): note: while compiling class 'cutlass::gemm::kernel::GemmUniversal<ProblemShape,CollectiveMainloop,CollectiveEpilogue,TileSchedule r_,enable_if<std::is_base_ofv<cutlass::gemm::KernelTmaWarpSpecializedPingpong,CollectiveMainloop::DispatchPolicy::Schedule>,void>::type>::SharedStorage' D:\git\onnxruntime\build\cuda12\Release_deps\cutlass-src\include\cutlass/gemm/kernel/sm90_gemm_tma_warpspecializedpingpong.hpp(133): note: while compiling class 'cutlass::gemm::kernel::GemmUniversal<ProblemShape,CollectiveMainloop,CollectiveEpilogue,TileSchedule r_,enable_if<std::is_base_ofv<cutlass::gemm::KernelTmaWarpSpecializedPingpong,CollectiveMainloop::DispatchPolicy::Schedule>,void>::type>::SharedStorage::PipelineStorage' D:\git\onnxruntime\build\cuda12\Release_deps\cutlass-src\include\cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp(140): error C3646: 'math_wg_order': unknown override specifier [D:\git\onnxruntime\build\cuda12\Release\onnxruntime_providers_cuda_obj.vcxpr oj] tmpxft_0000160c_00000000-7_image_scaler_impl.cudafe1.cpp C:\Program Files\Microsoft Visual Studio\2022\Enterprise\MSBuild\Microsoft\VC\v170\BuildCustomizations\CUDA 12.5.targets(799,9): error MSB3721: The command ""C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin\nvcc.exe" --use-local-env -ccbin "C:\Program File s\Microsoft Visual Studio\2022\Enterprise\VC\Tools\MSVC\14.40.33807\bin\HostX64\x64" -x cu -I"D:\git\onnxruntime\build\cuda12\Release_deps\utf8_range-src" -ID:\git\onnxruntime\include\onnxruntime -ID:\git\onnxruntime\include\onnxruntime\core\session -I"D:\git\onnxru ntime\build\cuda12\Release_deps\pytorch_cpuinfo-src\include" -ID:\git\onnxruntime\build\cuda12\Release -ID:\git\onnxruntime\onnxruntime -I"D:\git\onnxruntime\build\cuda12\Release_deps\abseil_cpp-src" -I"D:\git\onnxruntime\build\cuda12\Release_deps\safeint-src" -I"D: \git\onnxruntime\build\cuda12\Release_deps\gsl-src\include" -I"D:\git\onnxruntime\build\cuda12\Release_deps\date-src\include" -I"D:\git\onnxruntime\build\cuda12\Release_deps\onnx-src" -I"D:\git\onnxruntime\build\cuda12\Release_deps\onnx-build" -I"D:\git\onnxruntime \build\cuda12\Release_deps\protobuf-src\src" -I"D:\git\onnxruntime\build\cuda12\Release_deps\flatbuffers-src\include" -I"D:\git\onnxruntime\build\cuda12\Release_deps\cutlass-src\include" -I"D:\git\onnxruntime\build\cuda12\Release_deps\cutlass-src\examples" -I"D:\gi t\onnxruntime\build\cuda12\Release_deps\cutlass-src\tools\util\include" -I"D:\git\onnxruntime\build\cuda12\Release_deps\eigen-src" -I"C:\nvidia\TensorRT-10.0.1.6.Windows10.win10.cuda-12.4\TensorRT-10.0.1.6\include" -I"D:\git\onnxruntime\build\cuda12\Release_deps\cud nn_frontend-src\include" -I"D:\git\onnxruntime\build\cuda12\Release_deps\mp11-src\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\include" -I"C:\nvidia\cudnn-windows-x86_64-9.1.1.17_cuda12-archive\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\include" --keep-dir onnxrunt.7C32413E\x64\Release -maxrregcount=0 --machine 64 --compile -cudart shared -allow-unsupported-compiler --expt-relaxed-constexpr --Werror default-stream-launch -Xcudafe --diag_suppress=bad_friend_decl -Xcudafe --dia g_suppress=unsigned_compare_with_zero -Xcudafe --diag_suppress=expr_has_no_effect -include algorithm -std=c++17 --generate-code=arch=compute_89,code=[compute_89,sm_89] -Xcudafe --diag_suppress=conversion_function_not_usable --threads 1 -Werror all-warnings -Xcompiler=" /MP4 /guard:cf /Qspectre /Ob2 /EHsc -Ob2 -Zi /utf-8 /sdl /experimental:external /external:W0 /external:templates- /external:ID:/git/onnxruntime/cmake /external:ID:/git/onnxruntime/build/cuda12/Release /wd4251 /wd4201 /wd4324 /wd5054 /w15038 /wd4251 /wd4201 /wd4324 /wd5 054 /w15038 /wd4834 /wd4127 /Zc:cplusplus" -DWIN32 -D_WINDOWS -D_DISABLE_CONSTEXPR_MUTEX_CONSTRUCTOR -DWINAPI_FAMILY=100 -DWINVER=0x0A00 -D_WIN32_WINNT=0x0A00 -DNTDDI_VERSION=0x0A000000 -DONNXRUNTIME_ENABLE_INTEL_METEOR_LAKE_MOBILE_PLATFORM_PERF_PATCH -DNDEBUG -DCP UINFO_SUPPORTED_PLATFORM=1 -DEIGEN_USE_THREADS -DDISABLE_CUSPARSE_DEPRECATED -DPLATFORM_WINDOWS -DNOGDI -DNOMINMAX -D_USE_MATH_DEFINES -D_SILENCE_ALL_CXX17_DEPRECATION_WARNINGS -DUSE_CUDA=1 -DUSE_FLASH_ATTENTION=1 -DUSE_MEMORY_EFFICIENT_ATTENTION=1 -DUSE_TENSORRT=1 -DO NLY_C_LOCALE=0 -DONNX_NAMESPACE=onnx -DONNX_ML=1 -DONNX_USE_LITE_PROTO=1 -DONNX_NO_DOC_STRINGS -DWIN32_LEAN_AND_MEAN -DORT_ENABLE_STREAM -DEIGEN_MPL2_ONLY -DEIGEN_HAS_CONSTEXPR -DEIGEN_HAS_VARIADIC_TEMPLATES -DEIGEN_HAS_CXX11_MATH -DEIGEN_HAS_CXX11_ATOMIC -DEIGEN_STR ONG_INLINE=inline -D"CMAKE_INTDIR=\"Release\"" -D_MBCS -DWIN32 -D_WINDOWS -D_DISABLE_CONSTEXPR_MUTEX_CONSTRUCTOR -DWINAPI_FAMILY=100 -DWINVER=0x0A00 -D_WIN32_WINNT=0x0A00 -DNTDDI_VERSION=0x0A000000 -DONNXRUNTIME_ENABLE_INTEL_METEOR_LAKE_MOBILE_PLATFORM_PERF_PATCH -DNDE BUG -DEIGEN_HAS_C99_MATH -DCPUINFO_SUPPORTED -DCPUINFO_SUPPORTED_PLATFORM=1 -DEIGEN_USE_THREADS -DDISABLE_CUSPARSE_DEPRECATED -DPLATFORM_WINDOWS -DNOGDI -DNOMINMAX -D_USE_MATH_DEFINES -D_SILENCE_ALL_CXX17_DEPRECATION_WARNINGS -DUSE_CUDA=1 -DUSE_FLASHATTENTION=1 -DUSE MEMORY_EFFICIENT_ATTENTION=1 -DUSE_TENSORRT=1 -DONLY_C_LOCALE=0 -DONNX_NAMESPACE=onnx -DONNX_ML=1 -DONNX_USE_LITE_PROTO=1 -D__ONNX_NO_DOC_STRINGS -DWIN32_LEAN_AND_MEAN -DORT_ENABLE_STREAM -DEIGEN_MPL2_ONLY -DEIGEN_HAS_CONSTEXPR -DEIGEN_HAS_VARIADIC_TEMPLATES -DEIGEN_HA S_CXX11_MATH -DEIGEN_HAS_CXX11_ATOMIC -DEIGEN_STRONG_INLINE=inline -D"CMAKE_INTDIR=\"Release\"" -Xcompiler "/EHsc /W4 /nologo /O2 /FS /MD /GR" -Xcompiler "/Fdonnxruntime_providers_cuda_obj.dir\Release\onnxruntime_providers_cuda_obj.pdb" -o onnxruntime_providerscuda obj.dir\Release\moe_gemm_kernels_fp16_fp16.obj "D:\git\onnxruntime\onnxruntime\contrib_ops\cuda\moe\ft_moe\moe_gemm_kernels_fp16_fp16.cu"" exited with code 2. [D:\git\onnxruntime\build\cuda12\Release\onnxruntime_providers_cuda_obj.vcxproj]

I've tried to downgrade CUDA to 12.5.1 version and compile it against the rel-1.19.0 branch on Windows. Every time I'm failing with out of memory message against cutlass.

       (CustomBuild target) -> 
         C:\workspace\build\Release\_deps\cutlass-src\include\cute/int_tuple.hpp(51): catastrophic error : out of memory [C:\workspace\build\Release\onnxruntime_providers_cuda.vcxproj] [C:\tmp\tritonbuild\onnxruntime\build\ort_target.vcxproj]

out of memory message against cutlass

@mc-nv, for machine with 32GB memory, try limit parallel like --parallel 4 --nvcc_threads 1 like the following to avoid OOM:

build.bat --cmake_generator "Visual Studio 17 2022" --config Release ^
      --build_wheel --parallel --build_shared_lib ^
      --use_cuda --cuda_version "12.5" --cuda_home "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5" ^
      --cudnn_home "C:\nvidia\CuDNN\9.1.1.17_cuda12" ^
      --use_tensorrt --tensorrt_home "C:\nvidia\TensorRT\10.0.1.6.cuda-12.4" ^
      --parallel 4 --nvcc_threads 1 ^
      --skip_tests ^
      --use_binskim_compliant_compile_flags ^
      --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES=89

I had tried proposed above change in docker image configured with:	tool	version
BUILDTOOLS_VERSION	17.9.34622.214
CUDA_VERSION	12.5.1
CUDNN_VERSION	9.3.0.75
PYTHON_VERSION	3.10.11
TENSORRT_VERSION	10.3.0.26
VCPGK_VERSION	2024.03.19

Results of the below command:

RUN build.bat --cmake_generator "Visual Studio 17 2022" --config Release --cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=60;61;70;75;80;86;90" --skip_submodule_sync --parallel 4 --nvcc_threads 1 --build_shared_lib --compile_no_warning_as_error --skip_tests --update --build --build_dir /workspace/build --use_cuda --cuda_version "12.5" --cuda_home "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5" --cudnn_home "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5" --use_tensorrt --tensorrt_home "/tensorrt"

provides following error statements catastrophic error : out of memory:

 18>C:\workspace\build\Release\_deps\cutlass-src\include\cute/layout_composed.hpp(478): catastrophic error : out of memory [C:\workspace\build\Release\onnxruntime_providers_cuda.vcxproj] [C:\tmp\tritonbuild\onnxruntime\build\ort_target.vcxproj]
               return composition(a.layout_a(), a.offset(), zipped_divide(a.layout_b(), b));
                      ^
                     detected during:
                       instantiation of "auto cute::zipped_divide(const cute::ComposedLayout<A, O, B> &, const Tiler &) [with A=cute::Swizzle<3, 3, 3>, O=cute::_0, B=cute::Layout<cute::tuple<cute::_128, cute::tuple<cute::_64, cute::_2>>, cute::tuple<cute::_64, cute::tuple<cute::C<1>, cute::C<8192>>>>, Tiler=cute::tuple<cute::C<64>, cute::C<16>>]" at line 179 of C:\workspace\build\Release\_deps\cutlass-src\include\cute/atom/copy_atom.hpp
                       instantiation of "auto cute::TiledCopy<Copy_Atom, LayoutCopy_TV, ShapeTiler_MN>::tidfrg_S(STensor &&) [with Copy_Atom=cute::Copy_Atom<cute::SM75_U32x4_LDSM_N, cutlass::half_t>, LayoutCopy_TV=cute::Layout<cute::tuple<cute::tuple<cute::_4, cute::_8, cute::_4>, cute::tuple<cute::tuple<cute::_2, cute::_2, cute::_2>, cute::tuple<cute::_1, cute::_1>>>, cute::tuple<cute::tuple<cute::_128, cute::_1, cute::_16>, cute::tuple<cute::tuple<cute::_64, cute::_8, cute::_512>, cute::tuple<cute::_0, cute::_0>>>>, ShapeTiler_MN=cute::tuple<cute::C<64>, cute::C<16>>, STensor=cute::ComposedLayout<cute::Swizzle<3, 3, 3>, cute::_0, cute::Layout<cute::tuple<cute::_128, cute::tuple<cute::_64, cute::_2>>, cute::tuple<cute::_64, cute::tuple<cute::C<1>, cute::C<8192>>>>>]" at line 354 of C:\workspace\build\Release\_deps\cutlass-src\include\cute/atom/copy_atom.hpp
                       instantiation of "auto cute::ThrCopy<TiledCopy, ThrIdx>::partition_S(STensor &&) const [with TiledCopy=cute::TiledCopy<cute::Copy_Atom<cute::SM75_U32x4_LDSM_N, cutlass::half_t>, cute::Layout<cute::tuple<cute::tuple<cute::_4, cute::_8, cute::_4>, cute::tuple<cute::tuple<cute::_2, cute::_2, cute::_2>, cute::tuple<cute::_1, cute::_1>>>, cute::tuple<cute::tuple<cute::_128, cute::_1, cute::_16>, cute::tuple<cute::tuple<cute::_64, cute::_8, cute::_512>, cute::tuple<cute::_0, cute::_0>>>>, cute::tuple<cute::C<64>, cute::C<16>>>, ThrIdx=int, STensor=cute::Tensor<cute::ViewEngine<cute::smem_ptr<cutlass::half_t *>>, cute::ComposedLayout<cute::Swizzle<3, 3, 3>, cute::_0, cute::Layout<cute::tuple<cute::_128, cute::tuple<cute::_64, cute::_2>>, cute::tuple<cute::_64, cute::tuple<cute::C<1>, cute::C<8192>>>>>> &]" at line 168 of C:\workspace\onnxruntime\onnxruntime\contrib_ops/cuda/bert/flash_attention/flash_fwd_kernel.h
                       instantiation of "void onnxruntime::flash::compute_attn_1rowblock<Kernel_traits,Is_causal,Is_local,Has_alibi,Is_even_MN,Is_even_K,Return_softmax,Params>(const Params &, int, int, int) [with Kernel_traits=onnxruntime::flash::Flash_fwd_kernel_traits<128, 128, 64, 4, false, false, cutlass::half_t, onnxruntime::flash::Flash_kernel_traits<128, 128, 64, 4, cutlass::half_t>>, Is_causal=true, Is_local=false, Has_alibi=false, Is_even_MN=false, Is_even_K=true, Return_softmax=false, Params=onnxruntime::flash::Flash_fwd_params]" at line 998 of C:\workspace\onnxruntime\onnxruntime\contrib_ops/cuda/bert/flash_attention/flash_fwd_kernel.h
                       instantiation of "void onnxruntime::flash::compute_attn<Kernel_traits,Is_causal,Is_local,Has_alibi,Is_even_MN,Is_even_K,Return_softmax,Params>(const Params &) [with Kernel_traits=onnxruntime::flash::Flash_fwd_kernel_traits<128, 128, 64, 4, false, false, cutlass::half_t, onnxruntime::flash::Flash_kernel_traits<128, 128, 64, 4, cutlass::half_t>>, Is_causal=true, Is_local=false, Has_alibi=false, Is_even_MN=false, Is_even_K=true, Return_softmax=false, Params=onnxruntime::flash::Flash_fwd_params]" at line 32 of C:\workspace\onnxruntime\onnxruntime\contrib_ops/cuda/bert/flash_attention/flash_fwd_launch_template.h
                       instantiation of "void onnxruntime::flash::flash_fwd_kernel<Kernel_traits,Is_causal,Is_local,Has_alibi,Is_even_MN,Is_even_K,Return_softmax>(onnxruntime::flash::Flash_fwd_params) [with Kernel_traits=onnxruntime::flash::Flash_fwd_kernel_traits<128, 128, 64, 4, false, false, cutlass::half_t, onnxruntime::flash::Flash_kernel_traits<128, 128, 64, 4, cutlass::half_t>>, Is_causal=true, Is_local=false, Has_alibi=false, Is_even_MN=false, Is_even_K=true, Return_softmax=false]" at line 63 of C:\workspace\onnxruntime\onnxruntime\contrib_ops/cuda/bert/flash_attention/flash_fwd_launch_template.h
                       instantiation of "void onnxruntime::flash::run_flash_fwd<Kernel_traits,Is_causal>(onnxruntime::flash::Flash_fwd_params &, cudaStream_t) [with Kernel_traits=onnxruntime::flash::Flash_fwd_kernel_traits<128, 128, 64, 4, false, false, cutlass::half_t, onnxruntime::flash::Flash_kernel_traits<128, 128, 64, 4, cutlass::half_t>>, Is_causal=true]" at line 210 of C:\workspace\onnxruntime\onnxruntime\contrib_ops/cuda/bert/flash_attention/flash_fwd_launch_template.h
                       instantiation of "void onnxruntime::flash::run_mha_fwd_hdim128<T>(onnxruntime::flash::Flash_fwd_params &, cudaStream_t) [with T=cutlass::half_t]" at line 13 of C:\workspace\onnxruntime\onnxruntime\contrib_ops\cuda\bert\flash_attention\flash_fwd_hdim128_fp16_sm80.cu

           1 catastrophic error detected in the compilation of "C:/workspace/onnxruntime/onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_fwd_hdim128_fp16_sm80.cu".
           Compilation terminated.
           flash_fwd_hdim128_fp16_sm80.cu
    18>C:\BuildTools\MSBuild\Microsoft\VC\v170\BuildCustomizations\CUDA 12.5.targets(799,9): error MSB3721: The command ""C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5\bin\nvcc.exe"  --use-local-env -ccbin "C:\BuildTools\VC\Tools\MSVC\14.39.33519\bin\HostX64\x64" -x cu   -I"C:\workspace\build\Release\_deps\utf8_range-src" -IC:\workspace\onnxruntime\include\onnxruntime -IC:\workspace\onnxruntime\include\onnxruntime\core\session -I"C:\workspace\build\Release\_deps\pytorch_cpuinfo-src\include" -IC:\workspace\build\Release -IC:\workspace\onnxruntime\onnxruntime -I"C:\workspace\build\Release\_deps\abseil_cpp-src" -I"C:\workspace\build\Release\_deps\safeint-src" -I"C:\workspace\build\Release\_deps\gsl-src\include" -I"C:\workspace\build\Release\_deps\date-src\include" -I"C:\workspace\build\Release\_deps\onnx-src" -I"C:\workspace\build\Release\_deps\onnx-build" -I"C:\workspace\build\Release\_deps\protobuf-src\src" -I"C:\workspace\build\Release\_deps\flatbuffers-src\include" -I"C:\workspace\build\Release\_deps\cutlass-src\include" -I"C:\workspace\build\Release\_deps\cutlass-src\examples" -I"C:\workspace\build\Release\_deps\cutlass-src\tools\util\include" -I"C:\workspace\build\Release\_deps\eigen-src" -I\TensorRT\include -I"C:\workspace\build\Release\_deps\mp11-src\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5\include"     --keep-dir onnxrunt.4B28B068\x64\Release  -maxrregcount=0   --machine 64 --compile -cudart shared -allow-unsupported-compiler --expt-relaxed-constexpr --Werror default-stream-launch -Xcudafe --diag_suppress=bad_friend_decl -Xcudafe --diag_suppress=unsigned_compare_with_zero -Xcudafe --diag_suppress=expr_has_no_effect -include algorithm -std=c++17 --generate-code=arch=compute_60,code=[compute_60,sm_60] --generate-code=arch=compute_61,code=[compute_61,sm_61] --generate-code=arch=compute_70,code=[compute_70,sm_70] --generate-code=arch=compute_75,code=[compute_75,sm_75] --generate-code=arch=compute_80,code=[compute_80,sm_80] --generate-code=arch=compute_86,code=[compute_86,sm_86] --generate-code=arch=compute_90,code=[compute_90,sm_90] -Xcudafe --diag_suppress=conversion_function_not_usable --threads 1 -Xcompiler="/EHsc -Ob2 -Zi /utf-8 /sdl /experimental:external /external:W0 /external:templates- /external:IC:/workspace/onnxruntime/cmake /external:IC:/workspace/build/Release /wd4251 /wd4201 /wd4324 /wd5054 /w15038 /wd4251 /wd4201 /wd4324 /wd5054 /w15038 /wd4834 /wd4127 /Zc:__cplusplus"   -D_WINDOWS -DNDEBUG -DVER_MAJOR=1 -DVER_MINOR=19 -DVER_BUILD=0 -DVER_PRIVATE=0 -D"VER_STRING=\"ORT_VERSION\"" -DCPUINFO_SUPPORTED_PLATFORM=1 -DEIGEN_USE_THREADS -DDISABLE_CUSPARSE_DEPRECATED -DPLATFORM_WINDOWS -DNOGDI -DNOMINMAX -D_USE_MATH_DEFINES -D_SILENCE_ALL_CXX17_DEPRECATION_WARNINGS -DUSE_CUDA=1 -DUSE_FLASH_ATTENTION=1 -DUSE_MEMORY_EFFICIENT_ATTENTION=1 -DUSE_TENSORRT=1 -DONLY_C_LOCALE=0 -DONNX_NAMESPACE=onnx -DONNX_ML=1 -DONNX_USE_LITE_PROTO=1 -D__ONNX_NO_DOC_STRINGS -DWIN32_LEAN_AND_MEAN -DORT_ENABLE_STREAM -DEIGEN_MPL2_ONLY -DEIGEN_HAS_CONSTEXPR -DEIGEN_HAS_VARIADIC_TEMPLATES -DEIGEN_HAS_CXX11_MATH -DEIGEN_HAS_CXX11_ATOMIC -DEIGEN_STRONG_INLINE=inline -D"CMAKE_INTDIR=\"Release\"" -Donnxruntime_providers_cuda_EXPORTS -D_WINDLL -D_MBCS -DEIGEN_HAS_C99_MATH -DCPUINFO_SUPPORTED -DNDEBUG -DVER_MAJOR=1 -DVER_MINOR=19 -DVER_BUILD=0 -DVER_PRIVATE=0 -D"VER_STRING=\"ORT_VERSION\"" -DCPUINFO_SUPPORTED_PLATFORM=1 -DEIGEN_USE_THREADS -DDISABLE_CUSPARSE_DEPRECATED -DPLATFORM_WINDOWS -DNOGDI -DNOMINMAX -D_USE_MATH_DEFINES -D_SILENCE_ALL_CXX17_DEPRECATION_WARNINGS -DUSE_CUDA=1 -DUSE_FLASH_ATTENTION=1 -DUSE_MEMORY_EFFICIENT_ATTENTION=1 -DUSE_TENSORRT=1 -DONLY_C_LOCALE=0 -DONNX_NAMESPACE=onnx -DONNX_ML=1 -DONNX_USE_LITE_PROTO=1 -D__ONNX_NO_DOC_STRINGS -DWIN32_LEAN_AND_MEAN -DORT_ENABLE_STREAM -DEIGEN_MPL2_ONLY -DEIGEN_HAS_CONSTEXPR -DEIGEN_HAS_VARIADIC_TEMPLATES -DEIGEN_HAS_CXX11_MATH -DEIGEN_HAS_CXX11_ATOMIC -DEIGEN_STRONG_INLINE=inline -D"CMAKE_INTDIR=\"Release\"" -Donnxruntime_providers_cuda_EXPORTS -Xcompiler "/EHsc /W4 /nologo /O2 /FS   /MD /GR" -Xcompiler "/Fdonnxruntime_providers_cuda.dir\Release\vc143.pdb" -o onnxruntime_providers_cuda.dir\Release\flash_fwd_hdim128_fp16_sm80.obj "C:\workspace\onnxruntime\onnxruntime\contrib_ops\cuda\bert\flash_attention\flash_fwd_hdim128_fp16_sm80.cu"" exited with code 1. [C:\workspace\build\Release\onnxruntime_providers_cuda.vcxproj] [C:\tmp\tritonbuild\onnxruntime\build\ort_target.vcxproj]
           Compiling CUDA source file ..\..\onnxruntime\onnxruntime\contrib_ops\cuda\bert\flash_attention\flash_fwd_hdim192_bf16_sm80.cu...

@mc-nv,

Could you try upgrade Visual Studio to the latest version?

I tried Visual Studio Enterprise 2022 version 17.11.0 with latest MSVC v143 build tools, and there is no problem in my machine.

tool	version
BUILDTOOLS_VERSION	Visual Studio Enterprise 17.11.0 and MSVC v143(latest)
CUDA_VERSION	12.5.1
CUDNN_VERSION	9.3.0.75
PYTHON_VERSION	3.10.13 (from AnaConda)
TENSORRT_VERSION	10.3.0.26

Select all the build tools that marked as latest in Visual Studio Installer:

My build script:

pip install cmake numpy --upgrade
call "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Auxiliary\Build\vcvarsall.bat" amd64
build.bat --cmake_generator "Visual Studio 17 2022" --config Release --build_dir build\cuda12 --build_wheel ^
          --cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=60;61;70;75;80;86;90"  --parallel 4 --nvcc_threads 1 ^
          --build_shared_lib --skip_tests --compile_no_warning_as_error ^
           --use_cuda --cuda_version "12.5" --cuda_home "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5" ^
          --cudnn_home "C:\nvidia\cudnn\9.3.0.75_cuda12" --use_tensorrt --tensorrt_home "C:\nvidia\tensorrt\10.3.0.26_cuda12.5"

It shows that

-- Selecting Windows SDK version 10.0.22621.0 to target Windows 10.0.22631.
-- The C compiler identification is MSVC 19.41.34120.0
-- The CXX compiler identification is MSVC 19.41.34120.0
-- The ASM compiler identification is MSVC
-- Found assembler: C:/Program Files/Microsoft Visual Studio/2022/Enterprise/VC/Tools/MSVC/14.41.34120/bin/Hostx64/x64/cl.exe
-- Found Python: C:\Users\.conda\envs\py310\python.exe (found suitable version "3.10.13", minimum required is "3.8") 
MSBuild version 17.11.2+c078802d4 for .NET Framework
-- The CUDA compiler identification is NVIDIA 12.5.82
-- CMAKE_CUDA_COMPILER_VERSION: 12.5.82

Peak memory usage is about 31 GB during compiling. I used a machine with 32GB RAM and additional virtual memory (page file) 16 GB.

Were not able to compile against rel-1.19.0

BUILDTOOLS_VERSION:17.10.35201.131 
CMAKE_VERSION:3.30.1 
CUDA_VERSION:12.5.1 
CUDNN_VERSION:9.3.0.75 
PYTHON_VERSION:3.10.11 
TENSORRT_VERSION:10.3.0.26 
VCPGK_VERSION:2024.03.19

But was able to successfully build against rel-1.18.1 although I did't use suggested latest version of BuildTools 17.11, choose 17.10 LTSC instead.

Team, We are facing the same issue with the latest rel-1.19.2 as well. Please suggest a resolve.

I confirm that the following settings will be successful to build 1.19.2 on Windows:

Cuda: 12.5.1 CUDNN: 9.4.0. Visual Studio 17 2022 Specifying --compile_no_warning_as_error TensorRT: 10.4.0.26

I additionally installed protobuf, and zlib and added those binaries to PATH env.

Same for me we can build with 👍 CUDA 12.5 but not with 👎 CUDA 12.6

Succeed with following configuration:

BUILDTOOLS_VERSION:17.12.35309.182
CMAKE_VERSION:3.30.1 
CUDA_VERSION:12.5.1 
CUDNN_VERSION:9.3.0.75 
PYTHON_VERSION:3.10.11 
TENSORRT_VERSION:10.3.0.26 
VCPGK_VERSION:2024.03.19

Hi @snnn , Is it possible to update cutlas version to 3.5.1 in deps.txt in response on https://github.com/NVIDIA/cutlass/issues/1732 ?

@mc-nv, https://github.com/microsoft/onnxruntime/pull/21939 has cutlass 3.5.1. In my test, build is good with cuda 12.6 update 1 in Windows using 3.5.1. There is performance regression of flash attention on H100 using 3.5.1, which it is still under investigation.

Should have been resolved in #22316 . If not, please reopen this issue.

microsoft / onnxruntime