state-spaces / mamba

Mamba SSM architecture
Apache License 2.0
12.8k stars 1.08k forks source link

/usr/bin/ld: cannot find -lcuda #123

Closed jcrangel closed 8 months ago

jcrangel commented 8 months ago

When running

python benchmarks/benchmark_generation_mamba_simple.py --model-name "state -spaces/mamba-2.8b" --prompt "My cat wrote all this CUDA code for a new language model and" --topp 0.9 --temperature  0.7 --repetition-penalty 1.2

Full Error:

Number of parameters: 2768345600
/usr/bin/ld: cannot find -lcuda
collect2: error: ld returned 1 exit status
Traceback (most recent call last):
  File "/home/julio/repos/ssm/mamba/benchmarks/benchmark_generation_mamba_simple.py", line 80, in <module>
    out = fn()
  File "/home/julio/repos/ssm/mamba/benchmarks/benchmark_generation_mamba_simple.py", line 55, in <lambda>
    fn = lambda: model.generate(
  File "/home/julio/anaconda3/envs/ssm/lib/python3.10/site-packages/mamba_ssm/utils/generation.py", line 244, in generate
    output = decode(
  File "/home/julio/anaconda3/envs/ssm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/julio/anaconda3/envs/ssm/lib/python3.10/site-packages/mamba_ssm/utils/generation.py", line 145, in decode
    model._decoding_cache = update_graph_cache(
  File "/home/julio/anaconda3/envs/ssm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/julio/anaconda3/envs/ssm/lib/python3.10/site-packages/mamba_ssm/utils/generation.py", line 305, in update_graph_cache
    cache.callables[batch_size, decoding_seqlen] = capture_graph(
  File "/home/julio/anaconda3/envs/ssm/lib/python3.10/site-packages/mamba_ssm/utils/generation.py", line 339, in capture_graph
    logits = model(
  File "/home/julio/anaconda3/envs/ssm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/julio/anaconda3/envs/ssm/lib/python3.10/site-packages/mamba_ssm/models/mixer_seq_simple.py", line 233, in forward
    hidden_states = self.backbone(input_ids, inference_params=inference_params)
  File "/home/julio/anaconda3/envs/ssm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/julio/anaconda3/envs/ssm/lib/python3.10/site-packages/mamba_ssm/models/mixer_seq_simple.py", line 155, in forward
    hidden_states, residual = layer(
  File "/home/julio/anaconda3/envs/ssm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/julio/anaconda3/envs/ssm/lib/python3.10/site-packages/mamba_ssm/modules/mamba_simple.py", line 340, in forward
    hidden_states, residual = fused_add_norm_fn(
  File "/home/julio/anaconda3/envs/ssm/lib/python3.10/site-packages/mamba_ssm/ops/triton/layernorm.py", line 478, in rms_norm_fn
    return LayerNormFn.apply(x, weight, bias, residual, eps, prenorm, residual_in_fp32, True)
  File "/home/julio/anaconda3/envs/ssm/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/julio/anaconda3/envs/ssm/lib/python3.10/site-packages/mamba_ssm/ops/triton/layernorm.py", line 411, in forward
    y, mean, rstd, residual_out = _layer_norm_fwd(
  File "/home/julio/anaconda3/envs/ssm/lib/python3.10/site-packages/mamba_ssm/ops/triton/layernorm.py", line 155, in _layer_norm_fwd
    _layer_norm_fwd_1pass_kernel[(M,)](
  File "/home/julio/anaconda3/envs/ssm/lib/python3.10/site-packages/triton/runtime/jit.py", line 161, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/home/julio/anaconda3/envs/ssm/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 144, in run
    timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
  File "/home/julio/anaconda3/envs/ssm/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 144, in <dictcomp>
    timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
  File "/home/julio/anaconda3/envs/ssm/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 123, in _bench
    return do_bench(kernel_call, warmup=self.num_warmups, rep=self.num_reps, quantiles=(0.5, 0.2, 0.8))
  File "/home/julio/anaconda3/envs/ssm/lib/python3.10/site-packages/triton/testing.py", line 100, in do_bench
    fn()
  File "/home/julio/anaconda3/envs/ssm/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 110, in kernel_call
    self.fn.run(
  File "/home/julio/anaconda3/envs/ssm/lib/python3.10/site-packages/triton/runtime/jit.py", line 342, in run
    device = driver.get_current_device()
  File "/home/julio/anaconda3/envs/ssm/lib/python3.10/site-packages/triton/runtime/driver.py", line 22, in __getattr__
    self._initialize_obj()
  File "/home/julio/anaconda3/envs/ssm/lib/python3.10/site-packages/triton/runtime/driver.py", line 19, in _initialize_obj
    self._obj = self._init_fn()
  File "/home/julio/anaconda3/envs/ssm/lib/python3.10/site-packages/triton/runtime/driver.py", line 8, in _create_driver
    return actives[0]()
  File "/home/julio/anaconda3/envs/ssm/lib/python3.10/site-packages/triton/backends/cuda/driver.py", line 378, in __init__
    self.utils = CudaUtils()  # TODO: make static
  File "/home/julio/anaconda3/envs/ssm/lib/python3.10/site-packages/triton/backends/cuda/driver.py", line 47, in __init__
    mod = compile_module_from_src(Path(os.path.join(dirname, "driver.c")).read_text(), "cuda_utils")
  File "/home/julio/anaconda3/envs/ssm/lib/python3.10/site-packages/triton/backends/cuda/driver.py", line 24, in compile_module_from_src
    so = _build(name, src_path, tmpdir, library_dir, include_dir, libraries)
  File "/home/julio/anaconda3/envs/ssm/lib/python3.10/site-packages/triton/runtime/build.py", line 48, in _build
    ret = subprocess.check_call(cc_cmd)
  File "/home/julio/anaconda3/envs/ssm/lib/python3.10/subprocess.py", line 369, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpn74ad6q0/main.c', '-O3', '-shared', '-fPIC', '-o', '/tmp/tmpn74ad6q0/cuda_utils.cpython-310-x86_64-linux-gnu.so', '-lcuda', '-L/home/julio/anaconda3/envs/ssm/lib/python3.10/site-packages/triton/backends/cuda/lib', '-I/home/julio/anaconda3/envs/ssm/lib/python3.10/site-packages/triton/backends/cuda/include', '-I/tmp/tmpn74ad6q0', '-I/home/julio/anaconda3/envs/ssm/include/python3.10']' returned non-zero exit status 1.

I have update to Python 3.10, PyTorch 2.0.1(CUDA 11.8), and causal_conv1d-1.1.1 mamba-ssm-1.1.1 triton-2.1.0 This doesn't change the error. I'm on ubuntu 18.04. Also checked if my cuda libraries are there,

$>echo $LD_LIBRARY_PATH
/usr/local/cuda-11.8/lib64:/usr/local/cuda/lib64::
tridao commented 8 months ago

It's a triton error, try ldconfig?

Eupham commented 8 months ago

This works in a colab env, but yes triton error !export LC_ALL="en_US.UTF-8" !export LD_LIBRARY_PATH="/usr/lib64-nvidia" !export LIBRARY_PATH="/usr/local/cuda/lib64/stubs" !ldconfig /usr/lib64-nvidia

jcrangel commented 8 months ago

As suggested it was a Triton error and was solved with:

$export LC_ALL="en_US.UTF-8"
$export LIBRARY_PATH="/usr/local/cuda-11.8/lib64/stubs/"
$sudo ldconfig /usr/local/cuda-11.8/lib64
florinshen commented 1 month ago

a more neat solution, only export a specific env for Triton, as indicated https://github.com/triton-lang/triton/blob/fd691c67ac20958a67693358186d877790f5f48f/third_party/nvidia/backend/driver.py#L20

export TRITON_LIBCUDA_PATH=/usr/local/cuda/lib64