0.2.15 - 0.2.17 does not appear to use CUDA

JohanAR commented 1 year ago

When I tried upgrading text-generation-webui wheel version to 0.2.17 I noticed that it no longer uses CUDA for inference. Also tested in a smaller project and it behaved the same there. No errors are printed to the terminal. Downgrading to 0.2.14 or lower works.

jllllll commented 1 year ago

The build logs show that the wheels were built with CUDA, so I'm not sure what is going on there.

Which wheel/command did you use to install?

jllllll commented 1 year ago

Just did some testing and there appears to be a bug in the webui's llama-cpp-python code causing it to always load the CPU-only version. Use the installation command from this repo's README to replace the CPU-only version with the CUDA version until it is fixed.

Command that I used:

python -m pip install llama-cpp-python --prefer-binary --upgrade --extra-index-url=https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels/AVX2/cu121

May need to tweak it for your system.

JohanAR commented 1 year ago

IIRC it seemed like the llama_cpp_cuda imported llama_cpp, because I tried it manually in a terminal and it would throw an exception if both versions weren't installed. I used the 2 packages from webui's requirements.txt and then adjusted them for other versions. Will test some more on Wednesday.

pseudotensor commented 1 year ago

I have same problem. 0.14 was busted and giving wrong output. I moved to .18 and not using GPU, no mention of offload as usually would. When I didn't install CPU version, it fails as other person mentions, which is not usual.

Not using GPU for sure.

JohanAR commented 1 year ago

Perhaps it's these two, and other similar lines in llama-cpp-python which import by absolute name rather than relative. Python packaging always gives me nightmares but it feels like these might not be compatible with renaming the package to llama_cpp_cuda.

Screenshot_20231115_134801

jllllll commented 1 year ago

Nah. That shouldn't cause any issues. Of all the headaches that arise from Python packaging, namespaces and variables are handled pretty well. If the import names were an issue, then the renamed package wouldn't have been installable alongside the main one at all.

JohanAR commented 1 year ago

What I mean is that if you rename the CUDA compiled version to llama_cpp_cuda, the line with import llama_cpp.llama_cpp as llama_cpp will use the non-cuda version which exists in the system with that name. It would only affect the renamed package when installed in parallel with the CPU-only version. If using the CUDA compiled wheel you linked earlier, which I believe uses the original name, things ought to work but I haven't had time to verify it myself yet. It's also consistent with that llama_cpp_python throws an exception if llama_cpp isn't also installed.

JohanAR commented 1 year ago

pip install 'https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.17+cu121-cp311-cp311-manylinux_2_31_x86_64.whl'

>>> import llama_cpp_cuda
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/johan/Code/PycharmProjects/llaminatrix/venv/lib64/python3.11/site-packages/llama_cpp_cuda/__init__.py", line 2, in <module>
    from .llama import *
  File "/home/johan/Code/PycharmProjects/llaminatrix/venv/lib64/python3.11/site-packages/llama_cpp_cuda/llama.py", line 25, in <module>
    from .llama_grammar import LlamaGrammar
  File "/home/johan/Code/PycharmProjects/llaminatrix/venv/lib64/python3.11/site-packages/llama_cpp_cuda/llama_grammar.py", line 22, in <module>
    import llama_cpp.llama_cpp as llama_cpp
ModuleNotFoundError: No module named 'llama_cpp'

pseudotensor commented 1 year ago

Ya, but IMO, it's not normal practice to reference the absolute package inside itself. Should be relative, then downstream things would work fine.

But since they did that, and may have no urge to change, then has to be adjusted for.

JohanAR commented 1 year ago

I just submitted a PR to replace the absolute imports. It compiles (for CPU, with original name) and their tests still run so hopefully it doesn't break anything, but I can't easily compile the the renamed CUDA version of the package to verify that it solves this issue.

jllllll commented 1 year ago

I'll make a build for it to test real quick and replace the wheels in this repo if it works.

jllllll commented 1 year ago

@JohanAR The fix from your PR worked, so I'm going to be rebuilding the llama-cpp-python-cuda packages that are affected with that fix.

pseudotensor commented 1 year ago

Awesome, thanks!

JohanAR commented 1 year ago

@jllllll do you have any contact with abetlen? Perhaps you could ask him to merge the PR, if you have been able to confirm that it solves the issue, so you don't have to maintain workarounds for upstream issues.

jllllll commented 1 year ago

I don't have any direct contact with abetlen, so I wouldn't be able to do much. It isn't much work to implement the workaround into my workflows. I already have a decent collection of such workarounds implemented for a variety of build issues I've encountered. I implement them on a per-version basis, so I can just leave it in the workflow without needed to worry about it.

It would certainly be nice for your PR to be merged. However, at the end of the day, the actual issue only arises due to unintended usage and implementation of the package. It certainly wasn't abetlen's intent for people to have multiple copies of the package installed under a single Python installation. A more 'proper' fix would be for abetlen to implement a way to have both CPU and GPU binaries included in the main package. This is what ctransformers does and it works quite well for them, albeit with a more complicated and tedious build process. Alternatively, the llama.cpp devs can add a proper toggle to fully disable GPU support so that CUDA builds can still support CPU-only. As it stands, the only reason the llama-cpp-python-cuda packages exist is because the CUDA builds can't load on a system without CUDA support. This necessitates needing separate installations for CPU-only and CUDA.

I've already rebuilt all of the relevant llama-cpp-python-cuda packages with the fix. This includes 0.2.14-0.2.18. The first absolute import was in a commit in 0.2.14, but only affected a less used portion of the code, so the issue wasn't as noticeable in that version.

JohanAR commented 1 year ago

Aight, no worries. Just thought there was a possibility of you hanging out on the same discord servers or something :)

Personally I agree with Pseudotensor in that it looks a bit strange to use absolute imports within a Python package. It might not cause any issues for the average user, but I don't see any reason to not clean it up.

Either way it's great that you managed to create a workaround! I guess we can close this issue again?

jllllll commented 1 year ago

I'll go ahead and close it for now, but I do anticipate needing to reopen it at some point. I wouldn't be surprised if additional absolute imports are added eventually given that the current ones are spread across multiple commits over several weeks.

The workaround I used parses all of the Python scripts to search for the imports and change them as needed, so additional absolute imports won't necessarily be a problem in the future:

$pyScripts = (Get-ChildItem $(Join-Path '.' 'llama_cpp_cuda' '*.py'))
$pyScripts.fullname.foreach({New-Item $_ -itemType File -value (Get-Content $_ -raw).replace('import llama_cpp.llama','from . import llama') -force})

It searches for import llama_cpp.llama and changes it to from . import llama.

jllllll / llama-cpp-python-cuBLAS-wheels

0.2.15 - 0.2.17 does not appear to use CUDA #21