turingmotors / heron

Apache License 2.0
165 stars 25 forks source link

CUDA 11.8 failed #18

Open Topology1225 opened 1 year ago

Topology1225 commented 1 year ago

I'm reaching out to share a potential issue. While I've managed to resolve it on my end, others following the README's setup instructions might run into it.

Here's my setup:

After setting up via poetry as outlined in the README and running ./script/run.sh, I ran into the following error:

Traceback (most recent call last):
  File "~/heron/.venv/bin/deepspeed", line 3, in <module>
    from deepspeed.launcher.runner import main
  File "~/heron/.venv/lib/python3.10/site-packages/deepspeed/__init__.py", line 10, in <module>
    import torch
  File "~/heron/.venv/lib/python3.10/site-packages/torch/__init__.py", line 229, in <module>
    from torch._C import *  # noqa: F403
ImportError: libcudnn.so.8: cannot open shared object file: No such file or directory

I noticed the README mentions the expected CUDA version as 11.7, which suggests that using 11.8 might not be ideal. Given this, I reinstalled pytorch with:

poetry source add torch_cu118 --priority=explicit https://download.pytorch.org/whl/cu118

This fixed the issue and ./script/run.sh ran without any hitches. I've documented this to help anyone who might face this in the future.

If it helps, I'm happy to submit a pull request updating the pyproject.toml. If this isn't the right place for such feedback, please feel free to close this issue.

Thank you.

Ino-Ichan commented 1 year ago

@Topology1225 Thank you for conducting the operational check and providing a detailed report! I truly appreciate you sharing such valuable insights. It would be wonderful if you could submit a pull request.