Investigate problems with VLLM running on windows

Problems:

Problems seems to boil down to lack of windows build of triton dependency I don't see any indicator that there is any fundamental problem. Dev team closed PRs claiming they have no capacity to maintain windows builds (source). It seems that code worked for people engaged in this PR, at least a few months ago.
There are additional checks in code which needs to be disabled (at least vllm, but might be in dependencies as well) Example here but might be more.
Pytorch is installed in cpu version by default. Must be manually substituted for cuda version (source)
It can turn out in the process that other dependencies need our attention as well Other suspicious dependencies (claimed here):
- torch
- xformers
- Deepspeed (https://github.com/microsoft/DeepSpeed)
- BitsAndBytes
We can't be sure that we won't encounter any runtime problems after creating custom builds
There seems to be small performance penalty on WSL (source)
Custom builds of triton can have performance penalty (source)

Options:

Try to use one of triton unofficial builds for windows done by community There are few options available:
- https://huggingface.co/r4ziel/xformers_pre_built/resolve/main/triton-2.0.0-cp310-cp310-win_amd64.whl
- https://huggingface.co/madbuda/triton-windows-builds/resolve/main/triton-2.1.0-cp311-cp311-win_amd64.whl
- https://huggingface.co/r4ziel/xformers_pre_built/resolve/main/triton-2.0.0-cp310-cp310-win_amd64.whl
- https://drive.google.com/drive/folders/1aWSFb-ZR8TTIDdRlDBBCh-YvvCxmt6Bc
- https://github.com/PrashantSaikia/Triton-for-Windows/blob/main/triton-2.0.0-cp310-cp310-win_amd64.whl (It seems outdated looking at comments in issues) I was not able to pip install any of them, due to ERROR: triton-2.1.0-cp311-cp311-win_amd64.whl is not a supported wheel on this platform, but this might be problem with python target version.
Maintain fork of triton and build packages ourselves There were some successful attempts to do this:
- https://github.com/triton-lang/triton/pull/2738
- https://github.com/triton-lang/triton/pull/4045
- https://github.com/PrashantSaikia/Triton-for-Windows
- https://github.com/mantaionut/triton/tree/Windows_fix
- https://github.com/triton-lang/triton/issues/871 (These might be rather helpful hints, bu nothing out of the box)
- https://github.com/triton-lang/triton/pull/24 (Outdated: 2017) And there are some failures from which we can learn:
- https://github.com/triton-lang/triton/issues/1640#issuecomment-1695521195
Delegate preparing vllm for windows to external company
Distribute vllm as optional package with GamerHash requiring WSL (user will be asked and warned that installation requires privileges)
Get rid of torch.compile from vllm code torch.compile seems to be code optimization to run faster on cuda. Maybe it is possible to omit this step and accept suffering performance penalty. (Note: it is not the same as running on cpu. Just running not optimized cuda code) @pwalski managed to run whisper (which depends on pytorch) without having problems with triton nor pytorch. That could means, that triton is not really necessary for pytorch.
Give up integrating vllm :(

golemfactory / gamerhash-facade

Investigate problems with VLLM running on windows #151

Problems:

Options: