mlc-ai / xgrammar

Efficient, Flexible and Portable Structured Generation
https://xgrammar.mlc.ai/
Apache License 2.0
337 stars 17 forks source link

`RuntimeError: CUDA kernel compilation failure` on `KernelStore.compile(logits.device.index)` #99

Open AlbertoCastelo opened 2 days ago

AlbertoCastelo commented 2 days ago

hey folks!

Running the example fails at model.generate(...) with the following stacktrace

sourceCode.cu(1): catastrophic error: cannot open source file "cuda_fp16.h"
  #include <cuda_fp16.h>
                        ^

1 catastrophic error detected in the compilation of "sourceCode.cu".
Compilation terminated.

CUDA error code=6(b'NVRTC_ERROR_COMPILATION')

RuntimeError                              Traceback (most recent call last)
File [/usr/local/lib/python3.10/dist-packages/xgrammar/kernels/apply_token_bitmask_inplace_cuda.py:170](http://127.0.0.1:8888/usr/local/lib/python3.10/dist-packages/xgrammar/kernels/apply_token_bitmask_inplace_cuda.py#line=169), in KernelStore.compile(cls, device_id)
    163     opts = [
    164         b"--fmad=true",
    165         arch_arg,
   (...)
    168         b"-default-device",
    169     ]
--> 170     checkCudaErrors(nvrtc.nvrtcCompileProgram(prog, len(opts), opts))
    171 except RuntimeError as err:

File [/usr/local/lib/python3.10/dist-packages/xgrammar/kernels/apply_token_bitmask_inplace_cuda.py:88](http://127.0.0.1:8888/usr/local/lib/python3.10/dist-packages/xgrammar/kernels/apply_token_bitmask_inplace_cuda.py#line=87), in checkCudaErrors(result)
     87 if result[0].value:
---> 88     raise RuntimeError(
     89         "CUDA error code={}({})".format(result[0].value, _cudaGetErrorEnum(result[0]))
     90     )
     91 if len(result) == 1:

RuntimeError: CUDA error code=6(b'NVRTC_ERROR_COMPILATION')

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
Cell In[10], line 3
      1 xgr_logits_processor = xgr.contrib.hf.LogitsProcessor(compiled_grammar)
----> 3 generated_ids = model.generate(
      4     **model_inputs, max_new_tokens=512, logits_processor=[xgr_logits_processor]
      5 )
      6 generated_ids = generated_ids[0][len(model_inputs.input_ids[0]) :]
      7 print(tokenizer.decode(generated_ids, skip_special_tokens=True))

File [/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py:116](http://127.0.0.1:8888/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py#line=115), in context_decorator.<locals>.decorate_context(*args, **kwargs)
    113 @functools.wraps(func)
    114 def decorate_context(*args, **kwargs):
    115     with ctx_factory():
--> 116         return func(*args, **kwargs)

File [/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:2215](http://127.0.0.1:8888/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py#line=2214), in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
   2207     input_ids, model_kwargs = self._expand_inputs_for_generation(
   2208         input_ids=input_ids,
   2209         expand_size=generation_config.num_return_sequences,
   2210         is_encoder_decoder=self.config.is_encoder_decoder,
   2211         **model_kwargs,
   2212     )
   2214     # 12. run sample (it degenerates to greedy search when `generation_config.do_sample=False`)
-> 2215     result = self._sample(
   2216         input_ids,
   2217         logits_processor=prepared_logits_processor,
   2218         stopping_criteria=prepared_stopping_criteria,
   2219         generation_config=generation_config,
   2220         synced_gpus=synced_gpus,
   2221         streamer=streamer,
   2222         **model_kwargs,
   2223     )
   2225 elif generation_mode in (GenerationMode.BEAM_SAMPLE, GenerationMode.BEAM_SEARCH):
   2226     # 11. prepare beam search scorer
   2227     beam_scorer = BeamSearchScorer(
   2228         batch_size=batch_size,
   2229         num_beams=generation_config.num_beams,
   (...)
   2234         max_length=generation_config.max_length,
   2235     )

File [/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:3223](http://127.0.0.1:8888/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py#line=3222), in GenerationMixin._sample(self, input_ids, logits_processor, stopping_criteria, generation_config, synced_gpus, streamer, **model_kwargs)
   3220 next_token_logits = next_token_logits.to(input_ids.device)
   3222 # pre-process distribution
-> 3223 next_token_scores = logits_processor(input_ids, next_token_logits)
   3225 # Store scores, attentions and hidden_states when required
   3226 if return_dict_in_generate:

File [/usr/local/lib/python3.10/dist-packages/transformers/generation/logits_process.py:104](http://127.0.0.1:8888/usr/local/lib/python3.10/dist-packages/transformers/generation/logits_process.py#line=103), in LogitsProcessorList.__call__(self, input_ids, scores, **kwargs)
    102         scores = processor(input_ids, scores, **kwargs)
    103     else:
--> 104         scores = processor(input_ids, scores)
    106 return scores

File [/usr/local/lib/python3.10/dist-packages/xgrammar/contrib/hf.py:95](http://127.0.0.1:8888/usr/local/lib/python3.10/dist-packages/xgrammar/contrib/hf.py#line=94), in LogitsProcessor.__call__(self, input_ids, scores)
     93 if device_type != "cuda":
     94     scores = scores.to("cpu")
---> 95 xgr.apply_token_bitmask_inplace(scores, self.token_bitmask.to(scores.device))
     96 if device_type != "cuda":
     97     scores = scores.to(device_type)

File [/usr/local/lib/python3.10/dist-packages/xgrammar/matcher.py:110](http://127.0.0.1:8888/usr/local/lib/python3.10/dist-packages/xgrammar/matcher.py#line=109), in apply_token_bitmask_inplace(logits, bitmask, indices)
    104     raise ValueError(
    105         "logits and bitmask should be on the same device. "
    106         + f"But got logits.device: {logits.device}, bitmask.device: {bitmask.device}"
    107     )
    109 if logits.device.type == "cuda":
--> 110     apply_token_bitmask_inplace_cuda(logits, bitmask, indices)
    111 elif logits.device.type == "cpu":
    112     apply_token_bitmask_inplace_cpu(logits, bitmask, indices)

File [/usr/local/lib/python3.10/dist-packages/xgrammar/kernels/apply_token_bitmask_inplace_cuda.py:235](http://127.0.0.1:8888/usr/local/lib/python3.10/dist-packages/xgrammar/kernels/apply_token_bitmask_inplace_cuda.py#line=234), in apply_token_bitmask_inplace_cuda(logits, bitmask, indices)
    232     indices = torch.tensor(indices, dtype=torch.int32, device=logits.device)
    234 # Compile the kernel.
--> 235 kernel = KernelStore.compile(logits.device.index)
    237 # Setup kernel launching arguments.
    238 bitmask_size = math.ceil(vocab_size [/](http://127.0.0.1:8888/) BITS_PER_BLOCK)

File [/usr/local/lib/python3.10/dist-packages/xgrammar/kernels/apply_token_bitmask_inplace_cuda.py:178](http://127.0.0.1:8888/usr/local/lib/python3.10/dist-packages/xgrammar/kernels/apply_token_bitmask_inplace_cuda.py#line=177), in KernelStore.compile(cls, device_id)
    176     print(log.decode())
    177     print(err)
--> 178     raise RuntimeError("CUDA kernel compilation failure")
    180 dataSize = checkCudaErrors(nvrtc.nvrtcGetPTXSize(prog))
    181 data = b" " * dataSize

RuntimeError: CUDA kernel compilation failure

Environment

Cuda

 NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2

I also installed after some initial errors.

python3 -m pip install cuda-python nvidia-cuda-nvrtc-cu12

pip freeze

accelerate==1.1.1
aiohappyeyeballs==2.4.3
aiohttp==3.11.7
aiosignal==1.3.1
airportsdata==20241001
annotated-types==0.7.0
anyio==4.6.2.post1
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
asttokens==2.4.1
async-lru==2.0.4
async-timeout==5.0.1
attrs==24.2.0
babel==2.16.0
beautifulsoup4==4.12.3
bitsandbytes==0.44.1
bleach==6.2.0
blobfile==3.0.0
cachetools==5.5.0
certifi==2024.8.30
cffi==1.17.1
charset-normalizer==3.4.0
click==8.1.7
cloudpickle==3.1.0
comm==0.2.2
cuda-python==12.6.2.post1
datasets==3.1.0
dbus-python==1.2.18
debugpy==1.8.9
decorator==5.1.1
defusedxml==0.7.1
dill==0.3.8
diskcache==5.6.3
distro==1.7.0
docker-pycreds==0.4.0
docstring_parser==0.16
exceptiongroup==1.2.2
executing==2.1.0
fastjsonschema==2.20.0
filelock==3.16.1
fire==0.7.0
fqdn==1.5.1
frozenlist==1.5.0
fsspec==2024.9.0
gitdb==4.0.11
GitPython==3.1.43
google-api-core==2.23.0
google-auth==2.36.0
google-cloud-aiplatform==1.73.0
google-cloud-bigquery==3.27.0
google-cloud-core==2.4.1
google-cloud-resource-manager==1.13.1
google-cloud-storage==2.18.2
google-crc32c==1.6.0
google-resumable-media==2.7.2
googleapis-common-protos==1.66.0
grpc-google-iam-v1==0.13.1
grpcio==1.68.0
grpcio-status==1.68.0
h11==0.14.0
hf_transfer==0.1.8
httpcore==1.0.7
httpx==0.27.2
huggingface-hub==0.26.2
idna==3.10
iniconfig==2.0.0
inquirerpy==0.3.4
interegular==0.3.3
ipykernel==6.29.5
ipython==8.29.0
isoduration==20.11.0
jedi==0.19.2
Jinja2==3.1.4
json5==0.9.28
jsonpointer==3.0.0
jsonschema==4.23.0
jsonschema-specifications==2024.10.1
jupyter-events==0.10.0
jupyter-lsp==2.2.5
jupyter_client==8.6.3
jupyter_core==5.7.2
jupyter_server==2.14.2
jupyter_server_terminals==0.5.3
jupyterlab==4.3.1
jupyterlab_pygments==0.3.0
jupyterlab_server==2.27.3
kagglehub==0.3.4
lark==1.2.2
lxml==5.3.0
MarkupSafe==2.1.5
matplotlib-inline==0.1.7
mistune==3.0.2
mpmath==1.3.0
multidict==6.1.0
multiprocess==0.70.16
nbclient==0.10.0
nbconvert==7.16.4
nbformat==5.10.4
nest-asyncio==1.6.0
networkx==3.4.2
notebook_shim==0.2.4
numpy==1.26.4
nvidia-cublas-cu12==12.4.5.8
nvidia-cuda-cupti-cu12==12.4.127
nvidia-cuda-nvrtc-cu12==12.4.127
nvidia-cuda-runtime-cu12==12.4.127
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.2.1.3
nvidia-curand-cu12==10.3.5.147
nvidia-cusolver-cu12==11.6.1.9
nvidia-cusparse-cu12==12.3.1.170
nvidia-cusparselt-cu12==0.6.2
nvidia-ml-py==12.535.161
nvidia-nccl-cu12==2.21.5
nvidia-nvjitlink-cu12==12.4.127
nvidia-nvtx-cu12==12.4.127
nvitop==1.3.2
omegaconf==2.4.0.dev3
outlines==0.1.5
outlines_core==0.1.17
overrides==7.7.0
packaging==24.2
pandas==2.2.3
pandocfilters==1.5.1
parso==0.8.4
petname==2.6
pexpect==4.9.0
pfzy==0.3.4
pillow==11.0.0
platformdirs==4.3.6
pluggy==1.5.0
prometheus_client==0.21.0
prompt_toolkit==3.0.48
propcache==0.2.0
proto-plus==1.25.0
protobuf==5.28.3
psutil==6.1.0
ptyprocess==0.7.0
pure_eval==0.2.3
pyarrow==18.0.0
pyasn1==0.6.1
pyasn1_modules==0.4.1
pybind11==2.13.6
pycountry==24.6.1
pycparser==2.22
pycryptodomex==3.21.0
pydantic==2.10.1
pydantic_core==2.27.1
Pygments==2.18.0
PyGObject==3.42.1
pytest==8.3.3
python-dateutil==2.9.0.post0
python-json-logger==2.0.7
pytorch-triton==3.1.0+cf34004b8a
pytz==2024.2
PyYAML==6.0.2
pyzmq==26.2.0
referencing==0.35.1
regex==2024.11.6
requests==2.32.3
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rpds-py==0.21.0
rsa==4.9
ruamel.yaml==0.18.6
ruamel.yaml.clib==0.2.12
safetensors==0.4.6.dev0
Send2Trash==1.8.3
sentencepiece==0.2.0
sentry-sdk==2.19.0
setproctitle==1.3.4
shapely==2.0.6
six==1.16.0
smmap==5.0.1
sniffio==1.3.1
soupsieve==2.6
ssh-import-id==5.11
stack-data==0.6.3
sympy==1.13.1
termcolor==2.5.0
terminado==0.18.1
tiktoken==0.8.0
tinycss2==1.4.0
tokenizers==0.20.3
tomli==2.1.0
torch==2.6.0.dev20241121+cu124
torchao==0.7.0.dev20241121+cu124
torchtune==0.5.0.dev20241121+cu124
torchvision==0.20.0.dev20241121+cu124
tornado==6.4.2
tqdm==4.67.0
traitlets==5.14.3
transformers==4.46.3
types-python-dateutil==2.9.0.20241003
typing_extensions==4.12.2
tzdata==2024.2
uri-template==1.3.0
urllib3==2.2.3
wandb==0.18.7
wcwidth==0.2.13
webcolors==24.11.1
webencodings==0.5.1
websocket-client==1.8.0
xgrammar==0.1.4
xxhash==3.5.0
yarl==1.18.0
Ubospica commented 1 day ago

Hi @AlbertoCastelo, thanks for your bug report! We confirmed that is because CUDA 12.2 does not have cudafp16.h. We will fix it soon.

roG0d commented 1 day ago

@AlbertoCastelo I was facing a similar error on the same lines of code:

/usr/local/cuda/include/cuda/std/detail/libcxx/include/limits(344): error: floating constant is out of range
      _LIBCUDACXX_INLINE_VISIBILITY static _LIBCUDACXX_CONSTEXPR type denorm_min() _NOEXCEPT {return __FLT_DENORM_MIN__;}
                                                                                                     ^

/usr/local/cuda/include/cuda/std/detail/libcxx/include/limits(396): error: floating constant is out of range
      _LIBCUDACXX_INLINE_VISIBILITY static _LIBCUDACXX_CONSTEXPR type denorm_min() _NOEXCEPT {return __DBL_DENORM_MIN__;}
                                                                                                     ^

2 errors detected in the compilation of "sourceCode.cu".

CUDA error code=6(b'NVRTC_ERROR_COMPILATION')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/xgrammar/kernels/apply_token_bitmask_inplace_cuda.py", line 170, in compile
    checkCudaErrors(nvrtc.nvrtcCompileProgram(prog, len(opts), opts))
  File "/usr/local/lib/python3.10/dist-packages/xgrammar/kernels/apply_token_bitmask_inplace_cuda.py", line 88, in checkCudaErrors
    raise RuntimeError(
RuntimeError: CUDA error code=6(b'NVRTC_ERROR_COMPILATION')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/ebnf_xgrammar.py", line 33, in <module>
    generated_ids = model.generate(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2215, in generate
    result = self._sample(
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 3223, in _sample
    next_token_scores = logits_processor(input_ids, next_token_logits)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/logits_process.py", line 104, in __call__
    scores = processor(input_ids, scores)
  File "/usr/local/lib/python3.10/dist-packages/xgrammar/contrib/hf.py", line 95, in __call__
    xgr.apply_token_bitmask_inplace(scores, self.token_bitmask.to(scores.device))
  File "/usr/local/lib/python3.10/dist-packages/xgrammar/matcher.py", line 110, in apply_token_bitmask_inplace
    apply_token_bitmask_inplace_cuda(logits, bitmask, indices)
  File "/usr/local/lib/python3.10/dist-packages/xgrammar/kernels/apply_token_bitmask_inplace_cuda.py", line 235, in apply_token_bitmask_inplace_cuda
    kernel = KernelStore.compile(logits.device.index)
  File "/usr/local/lib/python3.10/dist-packages/xgrammar/kernels/apply_token_bitmask_inplace_cuda.py", line 178, in compile
    raise RuntimeError("CUDA kernel compilation failure")
RuntimeError: CUDA kernel compilation failure

The solution was to install the NVIDIA CUDA Toolkit 12.4 with the proper drivers that support it, in my specific case: 550.90.12.

I hope it helps!

Ubospica commented 23 hours ago

Hi @roG0d, you are right, this problem is caused by an incompatible cuda-python compilation workflow with CUDA version prior to 12.4.

We just released a new version where we switch to a triton kernel implementation and avoid using cuda-python. This should be compatible with all CUDA versions. Please try

pip install xgrammar==v0.1.5.rc1

and the problem should be fixed.

We will release v0.1.5.rc1 first and if no problem occurs, we will release v0.1.5 later today.

@AlbertoCastelo, this should fix your problem as well!

zcasanova commented 21 hours ago

I got the following error with xgrammar==v0.1.5.rc1:

/tmp/tmpamzpmugc/main.c:5:10: fatal error: Python.h: No such file or directory
    5 | #include <Python.h>
      |          ^~~~~~~~~~
compilation terminated.
zcasanova commented 20 hours ago

I got the following error with xgrammar==v0.1.5.rc1:

/tmp/tmpamzpmugc/main.c:5:10: fatal error: Python.h: No such file or directory
    5 | #include <Python.h>
      |          ^~~~~~~~~~
compilation terminated.

I just had to install python header files, everything is working now :-):

sudo apt update
sudo apt install python3.11-dev
Ubospica commented 19 hours ago

Hi @zcasanova, thanks for your feedback. That is a requirement for building XGrammar. But to run XGrammar on the Python side, that is not necessary.