huggingface / optimum-quanto

A pytorch quantization backend for optimum
Apache License 2.0
754 stars 55 forks source link

unsupported Microsoft Visual Studio version! #288

Open MMundane opened 3 weeks ago

MMundane commented 3 weeks ago

Hope this is the right place for this but quanto is causing major headaches.

I specifically installed VS Buildtools so i can get the cl.exe file required by quanto.

But its an unsupported version: unsupported Microsoft Visual Studio version! Only the versions between 2017 and 2022

This issue SHOULD be fixed by updating CUDA to 12.5 However this is unsupported by pytorch It costs money to get an older version of VS Studio

It is impossible to fix this problem.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig
from huggingface_hub import login  # Ensure this is imported
from PyQt5.QtWidgets import QApplication, QWidget, QVBoxLayout, QPushButton, QTextEdit, QLabel, QSizePolicy, QLineEdit

class QuestionAnswer:
    def __init__(self, model_name='google/gemma-2-9b-it'):
        self._setup_authentication()
        quantization_config = QuantoConfig(weights="int4")
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            quantization_config=quantization_config,
            device_map='cuda'
        )

    def _setup_authentication(self):
        # Use environment variable or other secure method to handle tokens
        token = ""  
        login(token) 

    def summarize_text(self, prompt, max_length=1024, min_length=124):
        enhanced_prompt = f"Please answer the following DnD question.\n\n{prompt}\n\nAnswer:"

        inputs = self.tokenizer(enhanced_prompt, return_tensors='pt').to(self.model.device)

        outputs = self.model.generate(
            **inputs,
            max_new_tokens=max_length,
            min_length=min_length,
            pad_token_id=self.tokenizer.eos_token_id
        )

        generated_text = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
        prompt_tokens = self.tokenizer.encode(enhanced_prompt)
        generated_tokens = self.tokenizer.encode(generated_text)
        clean_tokens = generated_tokens[len(prompt_tokens):]
        clean_text = self.tokenizer.decode(clean_tokens, skip_special_tokens=True).strip()

        answer = clean_text.split('\n', 1)[0]

        return answer

class AnswerApp(QWidget):
    def __init__(self):
        super().__init__()
        self.initUI()
        self.summarizer = QuestionAnswer()

    def initUI(self):
        self.setWindowTitle('DnD Question Answer')

        # Layout
        layout = QVBoxLayout()

        # Input text box
        self.inputTextBox = QTextEdit(self)
        self.inputTextBox.setPlaceholderText('Enter your DnD text here...')
        layout.addWidget(self.inputTextBox)

        # Max length input field
        self.maxLengthInput = QLineEdit(self)
        self.maxLengthInput.setPlaceholderText('Max length: (default 1024).')
        layout.addWidget(self.maxLengthInput)

        # Min length input field
        self.minLengthInput = QLineEdit(self)
        self.minLengthInput.setPlaceholderText('Min length: (default 124).')
        layout.addWidget(self.minLengthInput)

        # Summarize button
        self.summarizeButton = QPushButton('HIT IT', self)
        self.summarizeButton.clicked.connect(self.summarize_text)
        layout.addWidget(self.summarizeButton)

        # Output label
        self.outputLabel = QLabel('Answer will appear here...', self)
        self.outputLabel.setWordWrap(True)
        self.outputLabel.setSizePolicy(QSizePolicy.Expanding, QSizePolicy.Expanding)
        layout.addWidget(self.outputLabel)

        # Set layout
        self.setLayout(layout)

    def summarize_text(self):
        input_text = self.inputTextBox.toPlainText()
        max_length_text = self.maxLengthInput.text()
        min_length_text = self.minLengthInput.text()
        max_length = 1024  # Default value
        min_length = 124   # Default value

        if max_length_text.isdigit():
            max_length = int(max_length_text)

        if min_length_text.isdigit():
            min_length = int(min_length_text)

        if input_text:
            summary = self.summarizer.summarize_text(input_text, max_length, min_length)
            self.outputLabel.setText(summary)
        else:
            self.outputLabel.setText('Please enter some text to ask.')

# Run the application
if __name__ == "__main__":
    import sys
    app = QApplication(sys.argv)
    summarizerApp = AnswerApp()
    summarizerApp.show()
    sys.exit(app.exec_())
PS C:\Users\noah0\Documents\Python\Yune-Python> & c:/Users/noah0/Documents/Python/Yune-Python/.venv/Scripts/python.exe c:/Users/noah0/Documents/Python/Yune-Python/Projects/ASK/ASK.py
The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to C:\Users\noah0\.cache\huggingface\token
Login successful
Could not find the bitsandbytes CUDA binary at WindowsPath('C:/Users/noah0/Documents/Python/Yune-Python/.venv/Lib/site-packages/bitsandbytes/libbitsandbytes_cuda124.dll')
The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:18<00:00,  4.61s/it]
C:\Users\noah0\Documents\Python\Yune-Python\.venv\Lib\site-packages\torch\utils\cpp_extension.py:1964: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
C:\Users\noah0\Documents\Python\Yune-Python\.venv\Lib\site-packages\quanto\library\ops.py:66: UserWarning: An exception was raised while calling the optimized kernel for quanto::unpack: Error building extension 'quanto_cuda': [1/4] C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin\nvcc --generate-dependencies-with-compile --dependency-output gemv_cuda.cuda.o.d -Xcudafe --diag_suppress=dll_interface_conflict_dllexport_assumed -Xcudafe --diag_suppress=dll_interface_conflict_none_assumed -Xcudafe --diag_suppress=field_without_dll_interface -Xcudafe --diag_suppress=base_class_has_different_dll_interface -Xcompiler /EHsc -Xcompiler /wd4068 -Xcompiler /wd4067 -Xcompiler /wd4624 -Xcompiler /wd4190 -Xcompiler /wd4018 -Xcompiler /wd4275 -Xcompiler /wd4267 -Xcompiler /wd4244 -Xcompiler /wd4251 -Xcompiler /wd4819 -Xcompiler /MD -DTORCH_EXTENSION_NAME=quanto_cuda -DTORCH_API_INCLUDE_EXTENSION_H -IC:\Users\noah0\Documents\Python\Yune-Python\.venv\Lib\site-packages\torch\include -IC:\Users\noah0\Documents\Python\Yune-Python\.venv\Lib\site-packages\torch\include\torch\csrc\api\include -IC:\Users\noah0\Documents\Python\Yune-Python\.venv\Lib\site-packages\torch\include\TH -IC:\Users\noah0\Documents\Python\Yune-Python\.venv\Lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include" -IC:\Users\noah0\AppData\Local\Programs\Python\Python312\Include -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 -std=c++17 -O3 -std=c++17 -DENABLE_BF16 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads=8 -c C:\Users\noah0\Documents\Python\Yune-Python\.venv\Lib\site-packages\quanto\library\ext\cuda\awq\v2\gemv_cuda.cu -o gemv_cuda.cuda.o
FAILED: gemv_cuda.cuda.o
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin\nvcc --generate-dependencies-with-compile --dependency-output gemv_cuda.cuda.o.d -Xcudafe --diag_suppress=dll_interface_conflict_dllexport_assumed -Xcudafe --diag_suppress=dll_interface_conflict_none_assumed -Xcudafe --diag_suppress=field_without_dll_interface -Xcudafe --diag_suppress=base_class_has_different_dll_interface -Xcompiler /EHsc -Xcompiler /wd4068 -Xcompiler /wd4067 -Xcompiler /wd4624 -Xcompiler /wd4190 -Xcompiler /wd4018 -Xcompiler /wd4275 -Xcompiler /wd4267 -Xcompiler /wd4244 -Xcompiler /wd4251 -Xcompiler /wd4819 -Xcompiler /MD -DTORCH_EXTENSION_NAME=quanto_cuda -DTORCH_API_INCLUDE_EXTENSION_H -IC:\Users\noah0\Documents\Python\Yune-Python\.venv\Lib\site-packages\torch\include -IC:\Users\noah0\Documents\Python\Yune-Python\.venv\Lib\site-packages\torch\include\torch\csrc\api\include -IC:\Users\noah0\Documents\Python\Yune-Python\.venv\Lib\site-packages\torch\include\TH -IC:\Users\noah0\Documents\Python\Yune-Python\.venv\Lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include" -IC:\Users\noah0\AppData\Local\Programs\Python\Python312\Include -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 -std=c++17 -O3 -std=c++17 -DENABLE_BF16 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads=8 -c C:\Users\noah0\Documents\Python\Yune-Python\.venv\Lib\site-packages\quanto\library\ext\cuda\awq\v2\gemv_cuda.cu -o gemv_cuda.cuda.o
gemv_cuda.cu
cl : Command line warning D9025 : overriding '/D__CUDA_NO_HALF_OPERATORS__' with '/U__CUDA_NO_HALF_OPERATORS__'
cl : Command line warning D9025 : overriding '/D__CUDA_NO_HALF_CONVERSIONS__' with '/U__CUDA_NO_HALF_CONVERSIONS__'
cl : Command line warning D9025 : overriding '/D__CUDA_NO_BFLOAT16_CONVERSIONS__' with '/U__CUDA_NO_BFLOAT16_CONVERSIONS__'
gemv_cuda.cu
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\crt/host_config.h(153): fatal error C1189: #error:  -- unsupported Microsoft Visual Studio version! Only the versions between 2017 and 2022 (inclusive) are supported! The nvcc flag '-allow-unsupported-compiler' can be used to override this version check; however, using an unsupported host compiler may cause compilation failure or incorrect run time execution. Use at your own risk.
cl : Command line warning D9025 : overriding '/D__CUDA_NO_HALF_OPERATORS__' with '/U__CUDA_NO_HALF_OPERATORS__'
cl : Command line warning D9025 : overriding '/D__CUDA_NO_HALF_CONVERSIONS__' with '/U__CUDA_NO_HALF_CONVERSIONS__'
cl : Command line warning D9025 : overriding '/D__CUDA_NO_BFLOAT16_CONVERSIONS__' with '/U__CUDA_NO_BFLOAT16_CONVERSIONS__'
gemv_cuda.cu
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\crt/host_config.h(153): fatal error C1189: #error:  -- unsupported Microsoft Visual Studio version! Only the versions between 2017 and 2022 (inclusive) are supported! The nvcc flag '-allow-unsupported-compiler' can be used to override this version check; however, using an unsupported host compiler may cause compilation failure or incorrect run time execution. Use at your own risk.
[2/4] C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin\nvcc --generate-dependencies-with-compile --dependency-output gemm_cuda.cuda.o.d -Xcudafe --diag_suppress=dll_interface_conflict_dllexport_assumed -Xcudafe --diag_suppress=dll_interface_conflict_none_assumed -Xcudafe --diag_suppress=field_without_dll_interface -Xcudafe --diag_suppress=base_class_has_different_dll_interface -Xcompiler /EHsc -Xcompiler /wd4068 -Xcompiler /wd4067 -Xcompiler /wd4624 -Xcompiler /wd4190 -Xcompiler /wd4018 -Xcompiler /wd4275 -Xcompiler /wd4267 -Xcompiler /wd4244 -Xcompiler /wd4251 -Xcompiler /wd4819 -Xcompiler /MD -DTORCH_EXTENSION_NAME=quanto_cuda -DTORCH_API_INCLUDE_EXTENSION_H -IC:\Users\noah0\Documents\Python\Yune-Python\.venv\Lib\site-packages\torch\include -IC:\Users\noah0\Documents\Python\Yune-Python\.venv\Lib\site-packages\torch\include\torch\csrc\api\include -IC:\Users\noah0\Documents\Python\Yune-Python\.venv\Lib\site-packages\torch\include\TH -IC:\Users\noah0\Documents\Python\Yune-Python\.venv\Lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include" -IC:\Users\noah0\AppData\Local\Programs\Python\Python312\Include -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 -std=c++17 -O3 -std=c++17 -DENABLE_BF16 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads=8 -c C:\Users\noah0\Documents\Python\Yune-Python\.venv\Lib\site-packages\quanto\library\ext\cuda\awq\v2\gemm_cuda.cu -o gemm_cuda.cuda.o
FAILED: gemm_cuda.cuda.o
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin\nvcc --generate-dependencies-with-compile --dependency-output gemm_cuda.cuda.o.d -Xcudafe --diag_suppress=dll_interface_conflict_dllexport_assumed -Xcudafe --diag_suppress=dll_interface_conflict_none_assumed -Xcudafe --diag_suppress=field_without_dll_interface -Xcudafe --diag_suppress=base_class_has_different_dll_interface -Xcompiler /EHsc -Xcompiler /wd4068 -Xcompiler /wd4067 -Xcompiler /wd4624 -Xcompiler /wd4190 -Xcompiler /wd4018 -Xcompiler /wd4275 -Xcompiler /wd4267 -Xcompiler /wd4244 -Xcompiler /wd4251 -Xcompiler /wd4819 -Xcompiler /MD -DTORCH_EXTENSION_NAME=quanto_cuda -DTORCH_API_INCLUDE_EXTENSION_H -IC:\Users\noah0\Documents\Python\Yune-Python\.venv\Lib\site-packages\torch\include -IC:\Users\noah0\Documents\Python\Yune-Python\.venv\Lib\site-packages\torch\include\torch\csrc\api\include -IC:\Users\noah0\Documents\Python\Yune-Python\.venv\Lib\site-packages\torch\include\TH -IC:\Users\noah0\Documents\Python\Yune-Python\.venv\Lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include" -IC:\Users\noah0\AppData\Local\Programs\Python\Python312\Include -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 -std=c++17 -O3 -std=c++17 -DENABLE_BF16 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads=8 -c C:\Users\noah0\Documents\Python\Yune-Python\.venv\Lib\site-packages\quanto\library\ext\cuda\awq\v2\gemm_cuda.cu -o gemm_cuda.cuda.o
gemm_cuda.cu
cl : Command line warning D9025 : overriding '/D__CUDA_NO_HALF_OPERATORS__' with '/U__CUDA_NO_HALF_OPERATORS__'
cl : Command line warning D9025 : overriding '/D__CUDA_NO_HALF_CONVERSIONS__' with '/U__CUDA_NO_HALF_CONVERSIONS__'
cl : Command line warning D9025 : overriding '/D__CUDA_NO_BFLOAT16_CONVERSIONS__' with '/U__CUDA_NO_BFLOAT16_CONVERSIONS__'
gemm_cuda.cu
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\crt/host_config.h(153): fatal error C1189: #error:  -- unsupported Microsoft Visual Studio version! Only the versions between 2017 and 2022 (inclusive) are supported! The nvcc flag '-allow-unsupported-compiler' can be used to override this version check; however, using an unsupported host compiler may cause compilation failure or incorrect run time execution. Use at your own risk.
cl : Command line warning D9025 : overriding '/D__CUDA_NO_HALF_OPERATORS__' with '/U__CUDA_NO_HALF_OPERATORS__'
cl : Command line warning D9025 : overriding '/D__CUDA_NO_HALF_CONVERSIONS__' with '/U__CUDA_NO_HALF_CONVERSIONS__'
cl : Command line warning D9025 : overriding '/D__CUDA_NO_BFLOAT16_CONVERSIONS__' with '/U__CUDA_NO_BFLOAT16_CONVERSIONS__'
gemm_cuda.cu
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\crt/host_config.h(153): fatal error C1189: #error:  -- unsupported Microsoft Visual Studio version! Only the versions between 2017 and 2022 (inclusive) are supported! The nvcc flag '-allow-unsupported-compiler' can be used to override this version check; however, using an unsupported host compiler may cause compilation failure or incorrect run time execution. Use at your own risk.
[3/4] C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin\nvcc --generate-dependencies-with-compile --dependency-output unpack.cuda.o.d -Xcudafe --diag_suppress=dll_interface_conflict_dllexport_assumed -Xcudafe --diag_suppress=dll_interface_conflict_none_assumed -Xcudafe --diag_suppress=field_without_dll_interface -Xcudafe --diag_suppress=base_class_has_different_dll_interface -Xcompiler /EHsc -Xcompiler /wd4068 -Xcompiler /wd4067 -Xcompiler /wd4624 -Xcompiler /wd4190 -Xcompiler /wd4018 -Xcompiler /wd4275 -Xcompiler /wd4267 -Xcompiler /wd4244 -Xcompiler /wd4251 -Xcompiler /wd4819 -Xcompiler /MD -DTORCH_EXTENSION_NAME=quanto_cuda -DTORCH_API_INCLUDE_EXTENSION_H -IC:\Users\noah0\Documents\Python\Yune-Python\.venv\Lib\site-packages\torch\include -IC:\Users\noah0\Documents\Python\Yune-Python\.venv\Lib\site-packages\torch\include\torch\csrc\api\include -IC:\Users\noah0\Documents\Python\Yune-Python\.venv\Lib\site-packages\torch\include\TH -IC:\Users\noah0\Documents\Python\Yune-Python\.venv\Lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include" -IC:\Users\noah0\AppData\Local\Programs\Python\Python312\Include -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 -std=c++17 -O3 -std=c++17 -DENABLE_BF16 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads=8 -c C:\Users\noah0\Documents\Python\Yune-Python\.venv\Lib\site-packages\quanto\library\ext\cuda\unpack.cu -o unpack.cuda.o
FAILED: unpack.cuda.o
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin\nvcc --generate-dependencies-with-compile --dependency-output unpack.cuda.o.d -Xcudafe --diag_suppress=dll_interface_conflict_dllexport_assumed -Xcudafe --diag_suppress=dll_interface_conflict_none_assumed -Xcudafe --diag_suppress=field_without_dll_interface -Xcudafe --diag_suppress=base_class_has_different_dll_interface -Xcompiler /EHsc -Xcompiler /wd4068 -Xcompiler /wd4067 -Xcompiler /wd4624 -Xcompiler /wd4190 -Xcompiler /wd4018 -Xcompiler /wd4275 -Xcompiler /wd4267 -Xcompiler /wd4244 -Xcompiler /wd4251 -Xcompiler /wd4819 -Xcompiler /MD -DTORCH_EXTENSION_NAME=quanto_cuda -DTORCH_API_INCLUDE_EXTENSION_H -IC:\Users\noah0\Documents\Python\Yune-Python\.venv\Lib\site-packages\torch\include -IC:\Users\noah0\Documents\Python\Yune-Python\.venv\Lib\site-packages\torch\include\torch\csrc\api\include -IC:\Users\noah0\Documents\Python\Yune-Python\.venv\Lib\site-packages\torch\include\TH -IC:\Users\noah0\Documents\Python\Yune-Python\.venv\Lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include" -IC:\Users\noah0\AppData\Local\Programs\Python\Python312\Include -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 -std=c++17 -O3 -std=c++17 -DENABLE_BF16 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads=8 -c C:\Users\noah0\Documents\Python\Yune-Python\.venv\Lib\site-packages\quanto\library\ext\cuda\unpack.cu -o unpack.cuda.o
unpack.cu
cl : Command line warning D9025 : overriding '/D__CUDA_NO_HALF_OPERATORS__' with '/U__CUDA_NO_HALF_OPERATORS__'
cl : Command line warning D9025 : overriding '/D__CUDA_NO_HALF_CONVERSIONS__' with '/U__CUDA_NO_HALF_CONVERSIONS__'
cl : Command line warning D9025 : overriding '/D__CUDA_NO_BFLOAT16_CONVERSIONS__' with '/U__CUDA_NO_BFLOAT16_CONVERSIONS__'
unpack.cu
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\crt/host_config.h(153): fatal error C1189: #error:  -- unsupported Microsoft Visual Studio version! Only the versions between 2017 and 2022 (inclusive) are supported! The nvcc flag '-allow-unsupported-compiler' can be used to override this version check; however, using an unsupported host compiler may cause compilation failure or incorrect run time execution. Use at your own risk.
cl : Command line warning D9025 : overriding '/D__CUDA_NO_HALF_OPERATORS__' with '/U__CUDA_NO_HALF_OPERATORS__'
cl : Command line warning D9025 : overriding '/D__CUDA_NO_HALF_CONVERSIONS__' with '/U__CUDA_NO_HALF_CONVERSIONS__'
cl : Command line warning D9025 : overriding '/D__CUDA_NO_BFLOAT16_CONVERSIONS__' with '/U__CUDA_NO_BFLOAT16_CONVERSIONS__'
unpack.cu
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include\crt/host_config.h(153): fatal error C1189: #error:  -- unsupported Microsoft Visual Studio version! Only the versions between 2017 and 2022 (inclusive) are supported! The nvcc flag '-allow-unsupported-compiler' can be used to override this version check; however, using an unsupported host compiler may cause compilation failure or incorrect run time execution. Use at your own risk.
ninja: build stopped: subcommand failed.
 Falling back to default implementation.
  warnings.warn(message + " Falling back to default implementation.")
C:\Users\noah0\Documents\Python\Yune-Python\.venv\Lib\site-packages\quanto\library\ops.py:66: UserWarning: An exception was raised while calling the optimized kernel for quanto::unpack: DLL load failed while importing quanto_cuda: The specified module could not be found. Falling back to default implementation.
  warnings.warn(message + " Falling back to default implementation.")
dacorvo commented 3 weeks ago

I see you are using the legacy quanto package that was integrated into transformers a few months ago. The package has been migrated to optimum since then, as optimum-quanto. Maybe as a first step to debug your issue you could use optimum-quanto directly ? cc @SunMarc