Open FurkanGozukara opened 1 month ago
it might be the torchao version is too low: "torchao==0.1", we introduced quantize_ in 0.4.0 I think: https://github.com/pytorch/ao/releases, but in the meantime our packages are only available in linux and mac right now I think
Can you try updating torchao? I don't think the top level quantize_
API is available in 0.1
But from what I understand, torch.compile() does not work on windows because triton is lacking windows support, which we use to codegen our quantization kernels, so I wouldn't expect this to work on windows.
i am about to test latest @jerryzh168 thank you will write here
it might be the torchao version is too low: "torchao==0.1", we introduced quantize_ in 0.4.0 I think: https://github.com/pytorch/ao/releases, but in the meantime our packages are only available in linux and mac right now I think
what pip finds latest version is
ERROR: Could not find a version that satisfies the requirement torchao==0.4.0 (from versions: 0.0.1, 0.0.3, 0.1)
how can i install latest on windows? python 3.10 windows 10
if you have wheel link i can directly install
we don't have a windows build today I think, cc @atalman can you provide a pointer to support a build in windows as well?
we don't have a windows build today I think, cc @atalman can you provide a pointer to support a build in windows as well?
awesome waiting to test thank you so much
You can probably install torchao from source. If you don't need the CUDA extensions, you can do
USE_CPP=0 pip install git+https://github.com/pytorch/ao
But again, since torch.compile() doesn't work in windows, it's not very useful.
is there a way to make the quantizations work on windows+nvidia gpu without torch.compile and inductor backend? I am mostly concerned about inference speedups.
I'm also in need of the wheel for torchao on Windows to get Quantization working for Flux, CogVideoX, etc. in my app. I'm fine without compile, but the other features are really needed to optimize vram. Tried installing from the github and running setup.py install from clone, but gave me errors. Hoping we can run something newer than v0.1 soon.. Thanks.
I'm also in need of the wheel for torchao on Windows to get Quantization working for Flux, CogVideoX, etc. in my app. I'm fine without compile, but the other features are really needed to optimize vram. Tried installing from the github and running setup.py install from clone, but gave me errors. Hoping we can run something newer than v0.1 soon.. Thanks.
so true
from this logic (that use linux not windows) why do we even have Python on Windows?, PyTorch on Windows? xFormers on Windows? if such stuff is not necessary on Windows?
I don't get logic of forcing people to use Linux. If we follow this mindset, why we have all these on Windows?
If you don't need the CUDA extensions (right now they are only for backing FPx and sparse marlin kernels I think), and you don't mind the lack of torch.compile() support, you can install torchao from source on Windows like I mentioned previously
set USE_CPP=0
pip install git+https://github.com/pytorch/ao
I don't have access to a Windows machine right now, so I just googled how to set environment variable on Windows here. You might need to adjust accordingly.
You are welcome to improve torchao experience on Windows. In fact, there are past PRs by the community, including me, that help build torchao successfully on Windows, including CUDA extension support.
If you don't need the CUDA extensions (right now they are only for backing FPx and sparse marlin kernels I think), and you don't mind the lack of torch.compile() support, you can install torchao from source on Windows like I mentioned previously
set USE_CPP=0 pip install git+https://github.com/pytorch/ao
I don't have access to a Windows machine right now, so I just googled how to set environment variable on Windows here. You might need to adjust accordingly.
You are welcome to improve torchao experience on Windows. In fact, there are past PRs by the community, including me, that help build torchao successfully on Windows, including CUDA extension support.
Thanks for the reply. I have a couple of clarifying questions - It seems that previously one was able to build torchao with cuda extension support on windows. What changed since then? Also, since torch.compile is not available on windows, what kind of speedups(if any) on gpu- can we expect for normal pytorch models quantized by torchao
@abhi-vandit Since there is no Windows CI, there is no guarantee that new CUDA extensions in torchao can be built correctly on Windows. However, most of the errors usually come from Unix-specific features, thus the fix is usually simple e.g. #951 #396. I think torchao welcome small fixes like these.
I mentioned not building CUDA extensions previously since usually it's quite involved to set up C++ and CUDA compiler on Windows. So if you don't need the CUDA extensions, it's not really worth the efforts.
what kind of speedups(if any) on gpu- can we expect for normal pytorch models quantized by torchao
I think most likely you will only see slow down. Perhaps you can still get some memory savings.
@gau-nernst Thanks for the prompt reply. Hope this changes in near future and we are able to utilize quantization for inference time speedups on windows as well.
If you don't need the CUDA extensions (right now they are only for backing FPx and sparse marlin kernels I think), and you don't mind the lack of torch.compile() support, you can install torchao from source on Windows like I mentioned previously
set USE_CPP=0 pip install git+https://github.com/pytorch/ao
I don't have access to a Windows machine right now, so I just googled how to set environment variable on Windows here. You might need to adjust accordingly.
You are welcome to improve torchao experience on Windows. In fact, there are past PRs by the community, including me, that help build torchao successfully on Windows, including CUDA extension support.
this worked
(venv) C:\Users\Furkan\Videos\a\venv\Scripts>pip freeze
filelock==3.13.1
fsspec==2024.2.0
Jinja2==3.1.3
MarkupSafe==2.1.5
mpmath==1.3.0
networkx==3.2.1
numpy==1.26.3
pillow==10.2.0
sympy==1.12
torch==2.4.1+cu124
torchao==0.6.0+git83d5b63
torchaudio==2.4.1+cu124
torchvision==0.19.1+cu124
typing_extensions==4.9.0
Just want to share that I've successfully installed Triton on Windows and called torch.compile
:
https://github.com/jakaline-dev/Triton_win/issues/2
Update: I've published Triton wheels in my fork, and torchao.quantization.autoquant
just works after installing torchao 0.5.0 from source
https://github.com/woct0rdho/triton-windows
@woct0rdho did you notice any performance regressions between Windows and Linux? Cause if this works this is very cool we should consider making some broader announcement on pytorch.org if you're interested
I did not do serious profiling yet. I don't dual boot Windows and Linux on the same machine, so I can only test Windows vs WSL on the same machine, and profiling the memory in WSL can be very tricky
What I'm sure is that autoquant
indeed reduces memory usage for models like SDXL and Flux on Windows. For now I can also run these models without quantization, but I think it can be crucial for users with smaller GPUs
So keep in mind that APIs like quantize_()
will make your model smaller but will not necessarily accelerate it since we rely heavily on later running torch.compile()
to get competitive performance
So one sanity check you can do is make sure the Triton generated kernels seem reasonable by running TORCH_LOGS="output_code" python script.py
Triton is already working. I've tried some simple test scripts, and I've seen users including myself get speedup when running large models like Flux and CogVideoX, but not in all cases
Some reports are here: https://www.reddit.com/r/StableDiffusion/comments/1g45n6n/triton_3_wheels_published_for_windows_and_working/
Yeah my sense is we can be a bit more principled about measuring performance. For example running this on all of pytorch/benchmark and seeing if there are serious perf gaps between Windows and Linux cause if the gap is small or gets smaller over time we could perhaps take a bigger dependency on your Triton fork and recommend people use it
cc @xuzhao9 who maintains torchbench
that would be amazing if we can close the performance gap between windows and linux
Yeah thank you, I'll try to catch up with this in my spare time
Just want to share that I've successfully installed Triton on Windows and called
torch.compile
: jakaline-dev/Triton_win#2Update: I've published Triton wheels in my fork, and
torchao.quantization.autoquant
just works after installing torchao 0.5.0 from source https://github.com/woct0rdho/triton-windows
I installed Triton on Windows (python.exe -m pip install triton-3.1.0-cp310-cp310-win_amd64.whl) but I can not install torchao from source because this error:
"C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.41.34120\bin\HostX86\x64\link.exe" /nologo /INCREMENTAL:NO /LTCG /DLL /MANIFEST:EMBED,ID=2 /MANIFESTUAC:NO /LIBPATH:C:\Users\Admin\Desktop\TorchAO\venv\lib\site-packages\torch\lib "/LIBPATH:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\lib\x64" /LIBPATH:C:\Users\Admin\Desktop\TorchAO\venv\libs "/LIBPATH:C:\Program Files\Python310\libs" "/LIBPATH:C:\Program Files\Python310" /LIBPATH:C:\Users\Admin\Desktop\TorchAO\venv\PCbuild\amd64 "/LIBPATH:C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.41.34120\ATLMFC\lib\x64" "/LIBPATH:C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.41.34120\lib\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\lib\um\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.22621.0\ucrt\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\\lib\10.0.22621.0\\um\x64" c10.lib torch.lib torch_cpu.lib torch_python.lib cudart.lib c10_cuda.lib torch_cuda.lib /EXPORT:PyInit__C C:\Users\Admin\Desktop\TorchAO\ao\build\temp.win-amd64-cpython-310\Release\torchao\csrc\cuda\fp6_llm\fp6_linear.obj C:\Users\Admin\Desktop\TorchAO\ao\build\temp.win-amd64-cpython-310\Release\torchao\csrc\cuda\sparse_marlin\marlin_kernel_nm.obj C:\Users\Admin\Desktop\TorchAO\ao\build\temp.win-amd64-cpython-310\Release\torchao\csrc\cuda\tensor_core_tiled_layout\tensor_core_tiled_layout.obj C:\Users\Admin\Desktop\TorchAO\ao\build\temp.win-amd64-cpython-310\Release\torchao\csrc\init.obj /OUT:build\lib.win-amd64-cpython-310\torchao\_C.cp310-win_amd64.pyd /IMPLIB:C:\Users\Admin\Desktop\TorchAO\ao\build\temp.win-amd64-cpython-310\Release\torchao\csrc\cuda\fp6_llm\_C.cp310-win_amd64.lib
Criando biblioteca C:\Users\Admin\Desktop\TorchAO\ao\build\temp.win-amd64-cpython-310\Release\torchao\csrc\cuda\fp6_llm\_C.cp310-win_amd64.lib e objeto C:\Users\Admin\Desktop\TorchAO\ao\build\temp.win-amd64-cpython-310\Release\torchao\csrc\cuda\fp6_llm\_C.cp310-win_amd64.exp
fp6_linear.obj : error LNK2001: s¡mbolo externo nÆo resolvido "void __cdecl SplitK_Reduction(struct __half *,float *,unsigned __int64,unsigned __int64,int)" (?SplitK_Reduction@@YAXPEAU__half@@PEAM_K2H@Z)
build\lib.win-amd64-cpython-310\torchao\_C.cp310-win_amd64.pyd : fatal error LNK1120: 1 externo nÆo resolvidos
error: command 'C:\\Program Files\\Microsoft Visual Studio\\2022\\Community\\VC\\Tools\\MSVC\\14.41.34120\\bin\\HostX86\\x64\\link.exe' failed with exit code 1120
[end of output]
What am I missing?
@blap It looks like an error when linking against CUDA
I installed on windows and failing
from torchao.quantization import quantize_
pip freeze